mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-24 08:39:59 -05:00
WIP documentation!
This commit is contained in:
parent
a1af18230c
commit
914289b414
280
job-conf.rst
280
job-conf.rst
@ -1,17 +1,19 @@
|
|||||||
brozzler job configuration
|
Brozzler Job Configuration
|
||||||
**************************
|
**************************
|
||||||
|
|
||||||
Jobs are defined using yaml files. Options may be specified either at the
|
Jobs are defined using yaml files. At least one seed url must be specified,
|
||||||
top-level or on individual seeds. At least one seed url must be specified,
|
|
||||||
everything else is optional.
|
everything else is optional.
|
||||||
|
|
||||||
an example
|
.. contents::
|
||||||
==========
|
|
||||||
|
Example
|
||||||
|
=======
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
id: myjob
|
id: myjob
|
||||||
time_limit: 60 # seconds
|
time_limit: 60 # seconds
|
||||||
|
proxy: 127.0.0.1:8000 # point at warcprox for archiving
|
||||||
ignore_robots: false
|
ignore_robots: false
|
||||||
max_claimed_sites: 2
|
max_claimed_sites: 2
|
||||||
warcprox_meta:
|
warcprox_meta:
|
||||||
@ -35,7 +37,7 @@ an example
|
|||||||
scope:
|
scope:
|
||||||
surt: http://(org,example,
|
surt: http://(org,example,
|
||||||
|
|
||||||
how inheritance works
|
How inheritance works
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
Most of the available options apply to seeds. Such options can also be
|
Most of the available options apply to seeds. Such options can also be
|
||||||
@ -79,101 +81,140 @@ Notice that:
|
|||||||
- Since ``buckets`` is a list, the merged result includes all the values from
|
- Since ``buckets`` is a list, the merged result includes all the values from
|
||||||
both the top level and the seed level.
|
both the top level and the seed level.
|
||||||
|
|
||||||
settings reference
|
Settings reference
|
||||||
==================
|
==================
|
||||||
|
|
||||||
|
Top-level settings
|
||||||
|
------------------
|
||||||
|
|
||||||
``id``
|
``id``
|
||||||
------
|
~~~~~~
|
||||||
+-----------+--------+----------+--------------------------+
|
+--------+----------+--------------------------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+===========+========+==========+==========================+
|
+========+==========+==========================+
|
||||||
| top-level | string | no | *generated by rethinkdb* |
|
| string | no | *generated by rethinkdb* |
|
||||||
+-----------+--------+----------+--------------------------+
|
+--------+----------+--------------------------+
|
||||||
An arbitrary identifier for this job. Must be unique across this deployment of
|
An arbitrary identifier for this job. Must be unique across this deployment of
|
||||||
brozzler.
|
brozzler.
|
||||||
|
|
||||||
``seeds``
|
|
||||||
---------
|
|
||||||
+-----------+------------------------+----------+---------+
|
|
||||||
| scope | type | required | default |
|
|
||||||
+===========+========================+==========+=========+
|
|
||||||
| top-level | list (of dictionaries) | yes | *n/a* |
|
|
||||||
+-----------+------------------------+----------+---------+
|
|
||||||
List of seeds. Each item in the list is a dictionary (associative array) which
|
|
||||||
defines the seed. It must specify ``url`` (see below) and can additionally
|
|
||||||
specify any of the settings of scope *seed-level*.
|
|
||||||
|
|
||||||
``max_claimed_sites``
|
``max_claimed_sites``
|
||||||
---------------------
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
+-----------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+===========+========+==========+=========+
|
+========+==========+=========+
|
||||||
| top-level | number | no | *none* |
|
| number | no | *none* |
|
||||||
+-----------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
Puts a cap on the number of sites belonging to a given job that can be brozzled
|
Puts a cap on the number of sites belonging to a given job that can be brozzled
|
||||||
simultaneously across the cluster. Addresses the problem of a job with many
|
simultaneously across the cluster. Addresses the problem of a job with many
|
||||||
seeds starving out other jobs.
|
seeds starving out other jobs.
|
||||||
|
|
||||||
|
``seeds``
|
||||||
|
~~~~~~~~~
|
||||||
|
+------------------------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========================+==========+=========+
|
||||||
|
| list (of dictionaries) | yes | *n/a* |
|
||||||
|
+------------------------+----------+---------+
|
||||||
|
List of seeds. Each item in the list is a dictionary (associative array) which
|
||||||
|
defines the seed. It must specify ``url`` (see below) and can additionally
|
||||||
|
specify any *seed* settings.
|
||||||
|
|
||||||
|
Seed-level-only settings
|
||||||
|
------------------------
|
||||||
|
These settings can be specified only at the seed level, unlike most seed
|
||||||
|
settings, which can also be specified at the top level.
|
||||||
|
|
||||||
``url``
|
``url``
|
||||||
-------
|
~~~~~~~
|
||||||
+------------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+============+========+==========+=========+
|
+========+==========+=========+
|
||||||
| seed-level | string | yes | *n/a* |
|
| string | yes | *n/a* |
|
||||||
+------------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
The seed url.
|
The seed url.
|
||||||
|
|
||||||
|
``username``
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``password``
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
Seed-level / top-level settings
|
||||||
|
-------------------------------
|
||||||
|
These are seed settings that can also be speficied at the top level, in which
|
||||||
|
case they are inherited by all seeds.
|
||||||
|
|
||||||
``metadata``
|
``metadata``
|
||||||
------------
|
~~~~~~~~~~~~
|
||||||
+-----------------------+------------+----------+---------+
|
+------------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+============+==========+=========+
|
+============+==========+=========+
|
||||||
| seed-level, top-level | dictionary | no | *none* |
|
| dictionary | no | *none* |
|
||||||
+-----------------------+------------+----------+---------+
|
+------------+----------+---------+
|
||||||
Arbitrary information about the crawl job or site. Merely informative, not used
|
Arbitrary information about the crawl job or site. Merely informative, not used
|
||||||
by brozzler for anything. Could be of use to some external process.
|
by brozzler for anything. Could be of use to some external process.
|
||||||
|
|
||||||
``time_limit``
|
``time_limit``
|
||||||
--------------
|
~~~~~~~~~~~~~~
|
||||||
+-----------------------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+========+==========+=========+
|
+========+==========+=========+
|
||||||
| seed-level, top-level | number | no | *none* |
|
| number | no | *none* |
|
||||||
+-----------------------+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
Time limit in seconds. If not specified, there no time limit. Time limit is
|
Time limit in seconds. If not specified, there no time limit. Time limit is
|
||||||
enforced at the seed level. If a time limit is specified at the top level, it
|
enforced at the seed level. If a time limit is specified at the top level, it
|
||||||
is inherited by each seed as described above, and enforced individually on each
|
is inherited by each seed as described above, and enforced individually on each
|
||||||
seed.
|
seed.
|
||||||
|
|
||||||
|
``proxy``
|
||||||
|
~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
HTTP proxy, with the format ``host:port``. Typically configured to point to
|
||||||
|
warcprox for archival crawling.
|
||||||
|
|
||||||
``ignore_robots``
|
``ignore_robots``
|
||||||
-----------------
|
~~~~~~~~~~~~~~~~~
|
||||||
+-----------------------+---------+----------+-----------+
|
+---------+----------+-----------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+=========+==========+===========+
|
+=========+==========+===========+
|
||||||
| seed-level, top-level | boolean | no | ``false`` |
|
| boolean | no | ``false`` |
|
||||||
+-----------------------+---------+----------+-----------+
|
+---------+----------+-----------+
|
||||||
If set to ``true``, brozzler will happily crawl pages that would otherwise be
|
If set to ``true``, brozzler will happily crawl pages that would otherwise be
|
||||||
blocked by robots.txt rules.
|
blocked by robots.txt rules.
|
||||||
|
|
||||||
``user_agent``
|
``user_agent``
|
||||||
--------------
|
~~~~~~~~~~~~~~
|
||||||
+-----------------------+---------+----------+---------+
|
+---------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+=========+==========+=========+
|
+=========+==========+=========+
|
||||||
| seed-level, top-level | string | no | *none* |
|
| string | no | *none* |
|
||||||
+-----------------------+---------+----------+---------+
|
+---------+----------+---------+
|
||||||
The ``User-Agent`` header brozzler will send to identify itself to web servers.
|
The ``User-Agent`` header brozzler will send to identify itself to web servers.
|
||||||
It's good ettiquette to include a project URL with a notice to webmasters that
|
It's good ettiquette to include a project URL with a notice to webmasters that
|
||||||
explains why you're crawling, how to block the crawler robots.txt and how to
|
explains why you're crawling, how to block the crawler robots.txt and how to
|
||||||
contact the operator if the crawl is causing problems.
|
contact the operator if the crawl is causing problems.
|
||||||
|
|
||||||
``warcprox_meta``
|
``warcprox_meta``
|
||||||
-----------------
|
~~~~~~~~~~~~~~~~~
|
||||||
+-----------------------+------------+----------+-----------+
|
+------------+----------+-----------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+============+==========+===========+
|
+============+==========+===========+
|
||||||
| seed-level, top-level | dictionary | no | ``false`` |
|
| dictionary | no | ``false`` |
|
||||||
+-----------------------+------------+----------+-----------+
|
+------------+----------+-----------+
|
||||||
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
||||||
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
||||||
pass settings and information to warcprox. Warcprox does not forward the header
|
pass settings and information to warcprox. Warcprox does not forward the header
|
||||||
@ -195,36 +236,95 @@ becomes::
|
|||||||
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
|
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
|
||||||
|
|
||||||
``scope``
|
``scope``
|
||||||
---------
|
~~~~~~~~~
|
||||||
+-----------------------+------------+----------+-----------+
|
+------------+----------+-----------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=======================+============+==========+===========+
|
+============+==========+===========+
|
||||||
| seed-level, top-level | dictionary | no | ``false`` |
|
| dictionary | no | ``false`` |
|
||||||
+-----------------------+------------+----------+-----------+
|
+------------+----------+-----------+
|
||||||
Scope rules. *TODO*
|
Scope rules. *TODO*
|
||||||
|
|
||||||
|
Scoping
|
||||||
|
=======
|
||||||
|
|
||||||
|
*TODO* explanation of scoping and scope rules
|
||||||
|
|
||||||
|
Scope settings
|
||||||
|
--------------
|
||||||
|
|
||||||
``surt``
|
``surt``
|
||||||
--------
|
~~~~~~~~
|
||||||
+-------------+--------+----------+---------------------------+
|
+--------+----------+---------------------------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=============+========+==========+===========================+
|
+========+==========+===========================+
|
||||||
| scope-level | string | no | *generated from seed url* |
|
| string | no | *generated from seed url* |
|
||||||
+-------------+--------+----------+---------------------------+
|
+--------+----------+---------------------------+
|
||||||
|
|
||||||
``accepts``
|
``accepts``
|
||||||
-----------
|
~~~~~~~~~~~
|
||||||
+-------------+------+----------+---------+
|
+------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=============+======+==========+=========+
|
+======+==========+=========+
|
||||||
| scope-level | list | no | *none* |
|
| list | no | *none* |
|
||||||
+-------------+------+----------+---------+
|
+------+----------+---------+
|
||||||
|
|
||||||
``blocks``
|
``blocks``
|
||||||
-----------
|
~~~~~~~~~~~
|
||||||
+-------------+------+----------+---------+
|
+------+----------+---------+
|
||||||
| scope | type | required | default |
|
| type | required | default |
|
||||||
+=============+======+==========+=========+
|
+======+==========+=========+
|
||||||
| scope-level | list | no | *none* |
|
| list | no | *none* |
|
||||||
+-------------+------+----------+---------+
|
+------+----------+---------+
|
||||||
|
|
||||||
|
|
||||||
|
Scope rule settings
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
``domain``
|
||||||
|
~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``substring``
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``regex``
|
||||||
|
~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``ssurt``
|
||||||
|
~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``surt``
|
||||||
|
~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
``parent_url_regex``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
+--------+----------+---------+
|
||||||
|
| type | required | default |
|
||||||
|
+========+==========+=========+
|
||||||
|
| string | no | *none* |
|
||||||
|
+--------+----------+---------+
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user