mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-23 08:09:48 -05:00
WIP documentation!
This commit is contained in:
parent
a1af18230c
commit
914289b414
280
job-conf.rst
280
job-conf.rst
@ -1,17 +1,19 @@
|
||||
brozzler job configuration
|
||||
Brozzler Job Configuration
|
||||
**************************
|
||||
|
||||
Jobs are defined using yaml files. Options may be specified either at the
|
||||
top-level or on individual seeds. At least one seed url must be specified,
|
||||
Jobs are defined using yaml files. At least one seed url must be specified,
|
||||
everything else is optional.
|
||||
|
||||
an example
|
||||
==========
|
||||
.. contents::
|
||||
|
||||
Example
|
||||
=======
|
||||
|
||||
::
|
||||
|
||||
id: myjob
|
||||
time_limit: 60 # seconds
|
||||
proxy: 127.0.0.1:8000 # point at warcprox for archiving
|
||||
ignore_robots: false
|
||||
max_claimed_sites: 2
|
||||
warcprox_meta:
|
||||
@ -35,7 +37,7 @@ an example
|
||||
scope:
|
||||
surt: http://(org,example,
|
||||
|
||||
how inheritance works
|
||||
How inheritance works
|
||||
=====================
|
||||
|
||||
Most of the available options apply to seeds. Such options can also be
|
||||
@ -79,101 +81,140 @@ Notice that:
|
||||
- Since ``buckets`` is a list, the merged result includes all the values from
|
||||
both the top level and the seed level.
|
||||
|
||||
settings reference
|
||||
Settings reference
|
||||
==================
|
||||
|
||||
Top-level settings
|
||||
------------------
|
||||
|
||||
``id``
|
||||
------
|
||||
+-----------+--------+----------+--------------------------+
|
||||
| scope | type | required | default |
|
||||
+===========+========+==========+==========================+
|
||||
| top-level | string | no | *generated by rethinkdb* |
|
||||
+-----------+--------+----------+--------------------------+
|
||||
~~~~~~
|
||||
+--------+----------+--------------------------+
|
||||
| type | required | default |
|
||||
+========+==========+==========================+
|
||||
| string | no | *generated by rethinkdb* |
|
||||
+--------+----------+--------------------------+
|
||||
An arbitrary identifier for this job. Must be unique across this deployment of
|
||||
brozzler.
|
||||
|
||||
``seeds``
|
||||
---------
|
||||
+-----------+------------------------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+===========+========================+==========+=========+
|
||||
| top-level | list (of dictionaries) | yes | *n/a* |
|
||||
+-----------+------------------------+----------+---------+
|
||||
List of seeds. Each item in the list is a dictionary (associative array) which
|
||||
defines the seed. It must specify ``url`` (see below) and can additionally
|
||||
specify any of the settings of scope *seed-level*.
|
||||
|
||||
``max_claimed_sites``
|
||||
---------------------
|
||||
+-----------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+===========+========+==========+=========+
|
||||
| top-level | number | no | *none* |
|
||||
+-----------+--------+----------+---------+
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| number | no | *none* |
|
||||
+--------+----------+---------+
|
||||
Puts a cap on the number of sites belonging to a given job that can be brozzled
|
||||
simultaneously across the cluster. Addresses the problem of a job with many
|
||||
seeds starving out other jobs.
|
||||
|
||||
``seeds``
|
||||
~~~~~~~~~
|
||||
+------------------------+----------+---------+
|
||||
| type | required | default |
|
||||
+========================+==========+=========+
|
||||
| list (of dictionaries) | yes | *n/a* |
|
||||
+------------------------+----------+---------+
|
||||
List of seeds. Each item in the list is a dictionary (associative array) which
|
||||
defines the seed. It must specify ``url`` (see below) and can additionally
|
||||
specify any *seed* settings.
|
||||
|
||||
Seed-level-only settings
|
||||
------------------------
|
||||
These settings can be specified only at the seed level, unlike most seed
|
||||
settings, which can also be specified at the top level.
|
||||
|
||||
``url``
|
||||
-------
|
||||
+------------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+============+========+==========+=========+
|
||||
| seed-level | string | yes | *n/a* |
|
||||
+------------+--------+----------+---------+
|
||||
~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | yes | *n/a* |
|
||||
+--------+----------+---------+
|
||||
The seed url.
|
||||
|
||||
``username``
|
||||
~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``password``
|
||||
~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
Seed-level / top-level settings
|
||||
-------------------------------
|
||||
These are seed settings that can also be speficied at the top level, in which
|
||||
case they are inherited by all seeds.
|
||||
|
||||
``metadata``
|
||||
------------
|
||||
+-----------------------+------------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+============+==========+=========+
|
||||
| seed-level, top-level | dictionary | no | *none* |
|
||||
+-----------------------+------------+----------+---------+
|
||||
~~~~~~~~~~~~
|
||||
+------------+----------+---------+
|
||||
| type | required | default |
|
||||
+============+==========+=========+
|
||||
| dictionary | no | *none* |
|
||||
+------------+----------+---------+
|
||||
Arbitrary information about the crawl job or site. Merely informative, not used
|
||||
by brozzler for anything. Could be of use to some external process.
|
||||
|
||||
``time_limit``
|
||||
--------------
|
||||
+-----------------------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+========+==========+=========+
|
||||
| seed-level, top-level | number | no | *none* |
|
||||
+-----------------------+--------+----------+---------+
|
||||
~~~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| number | no | *none* |
|
||||
+--------+----------+---------+
|
||||
Time limit in seconds. If not specified, there no time limit. Time limit is
|
||||
enforced at the seed level. If a time limit is specified at the top level, it
|
||||
is inherited by each seed as described above, and enforced individually on each
|
||||
seed.
|
||||
|
||||
``proxy``
|
||||
~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
HTTP proxy, with the format ``host:port``. Typically configured to point to
|
||||
warcprox for archival crawling.
|
||||
|
||||
``ignore_robots``
|
||||
-----------------
|
||||
+-----------------------+---------+----------+-----------+
|
||||
| scope | type | required | default |
|
||||
+=======================+=========+==========+===========+
|
||||
| seed-level, top-level | boolean | no | ``false`` |
|
||||
+-----------------------+---------+----------+-----------+
|
||||
~~~~~~~~~~~~~~~~~
|
||||
+---------+----------+-----------+
|
||||
| type | required | default |
|
||||
+=========+==========+===========+
|
||||
| boolean | no | ``false`` |
|
||||
+---------+----------+-----------+
|
||||
If set to ``true``, brozzler will happily crawl pages that would otherwise be
|
||||
blocked by robots.txt rules.
|
||||
|
||||
``user_agent``
|
||||
--------------
|
||||
+-----------------------+---------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+=========+==========+=========+
|
||||
| seed-level, top-level | string | no | *none* |
|
||||
+-----------------------+---------+----------+---------+
|
||||
~~~~~~~~~~~~~~
|
||||
+---------+----------+---------+
|
||||
| type | required | default |
|
||||
+=========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+---------+----------+---------+
|
||||
The ``User-Agent`` header brozzler will send to identify itself to web servers.
|
||||
It's good ettiquette to include a project URL with a notice to webmasters that
|
||||
explains why you're crawling, how to block the crawler robots.txt and how to
|
||||
contact the operator if the crawl is causing problems.
|
||||
|
||||
``warcprox_meta``
|
||||
-----------------
|
||||
+-----------------------+------------+----------+-----------+
|
||||
| scope | type | required | default |
|
||||
+=======================+============+==========+===========+
|
||||
| seed-level, top-level | dictionary | no | ``false`` |
|
||||
+-----------------------+------------+----------+-----------+
|
||||
~~~~~~~~~~~~~~~~~
|
||||
+------------+----------+-----------+
|
||||
| type | required | default |
|
||||
+============+==========+===========+
|
||||
| dictionary | no | ``false`` |
|
||||
+------------+----------+-----------+
|
||||
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
||||
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
||||
pass settings and information to warcprox. Warcprox does not forward the header
|
||||
@ -195,36 +236,95 @@ becomes::
|
||||
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
|
||||
|
||||
``scope``
|
||||
---------
|
||||
+-----------------------+------------+----------+-----------+
|
||||
| scope | type | required | default |
|
||||
+=======================+============+==========+===========+
|
||||
| seed-level, top-level | dictionary | no | ``false`` |
|
||||
+-----------------------+------------+----------+-----------+
|
||||
~~~~~~~~~
|
||||
+------------+----------+-----------+
|
||||
| type | required | default |
|
||||
+============+==========+===========+
|
||||
| dictionary | no | ``false`` |
|
||||
+------------+----------+-----------+
|
||||
Scope rules. *TODO*
|
||||
|
||||
Scoping
|
||||
=======
|
||||
|
||||
*TODO* explanation of scoping and scope rules
|
||||
|
||||
Scope settings
|
||||
--------------
|
||||
|
||||
``surt``
|
||||
--------
|
||||
+-------------+--------+----------+---------------------------+
|
||||
| scope | type | required | default |
|
||||
+=============+========+==========+===========================+
|
||||
| scope-level | string | no | *generated from seed url* |
|
||||
+-------------+--------+----------+---------------------------+
|
||||
~~~~~~~~
|
||||
+--------+----------+---------------------------+
|
||||
| type | required | default |
|
||||
+========+==========+===========================+
|
||||
| string | no | *generated from seed url* |
|
||||
+--------+----------+---------------------------+
|
||||
|
||||
``accepts``
|
||||
-----------
|
||||
+-------------+------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=============+======+==========+=========+
|
||||
| scope-level | list | no | *none* |
|
||||
+-------------+------+----------+---------+
|
||||
~~~~~~~~~~~
|
||||
+------+----------+---------+
|
||||
| type | required | default |
|
||||
+======+==========+=========+
|
||||
| list | no | *none* |
|
||||
+------+----------+---------+
|
||||
|
||||
``blocks``
|
||||
-----------
|
||||
+-------------+------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=============+======+==========+=========+
|
||||
| scope-level | list | no | *none* |
|
||||
+-------------+------+----------+---------+
|
||||
~~~~~~~~~~~
|
||||
+------+----------+---------+
|
||||
| type | required | default |
|
||||
+======+==========+=========+
|
||||
| list | no | *none* |
|
||||
+------+----------+---------+
|
||||
|
||||
|
||||
Scope rule settings
|
||||
-------------------
|
||||
|
||||
``domain``
|
||||
~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``substring``
|
||||
~~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``regex``
|
||||
~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``ssurt``
|
||||
~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``surt``
|
||||
~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
``parent_url_regex``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
+--------+----------+---------+
|
||||
| type | required | default |
|
||||
+========+==========+=========+
|
||||
| string | no | *none* |
|
||||
+--------+----------+---------+
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user