WIP documentation!

This commit is contained in:
Noah Levitt 2018-03-19 14:14:37 -07:00
parent a1af18230c
commit 914289b414

View File

@ -1,17 +1,19 @@
brozzler job configuration
Brozzler Job Configuration
**************************
Jobs are defined using yaml files. Options may be specified either at the
top-level or on individual seeds. At least one seed url must be specified,
Jobs are defined using yaml files. At least one seed url must be specified,
everything else is optional.
an example
==========
.. contents::
Example
=======
::
id: myjob
time_limit: 60 # seconds
proxy: 127.0.0.1:8000 # point at warcprox for archiving
ignore_robots: false
max_claimed_sites: 2
warcprox_meta:
@ -35,7 +37,7 @@ an example
scope:
surt: http://(org,example,
how inheritance works
How inheritance works
=====================
Most of the available options apply to seeds. Such options can also be
@ -79,101 +81,140 @@ Notice that:
- Since ``buckets`` is a list, the merged result includes all the values from
both the top level and the seed level.
settings reference
Settings reference
==================
Top-level settings
------------------
``id``
------
+-----------+--------+----------+--------------------------+
| scope | type | required | default |
+===========+========+==========+==========================+
| top-level | string | no | *generated by rethinkdb* |
+-----------+--------+----------+--------------------------+
~~~~~~
+--------+----------+--------------------------+
| type | required | default |
+========+==========+==========================+
| string | no | *generated by rethinkdb* |
+--------+----------+--------------------------+
An arbitrary identifier for this job. Must be unique across this deployment of
brozzler.
``seeds``
---------
+-----------+------------------------+----------+---------+
| scope | type | required | default |
+===========+========================+==========+=========+
| top-level | list (of dictionaries) | yes | *n/a* |
+-----------+------------------------+----------+---------+
List of seeds. Each item in the list is a dictionary (associative array) which
defines the seed. It must specify ``url`` (see below) and can additionally
specify any of the settings of scope *seed-level*.
``max_claimed_sites``
---------------------
+-----------+--------+----------+---------+
| scope | type | required | default |
+===========+========+==========+=========+
| top-level | number | no | *none* |
+-----------+--------+----------+---------+
~~~~~~~~~~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| number | no | *none* |
+--------+----------+---------+
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs.
``seeds``
~~~~~~~~~
+------------------------+----------+---------+
| type | required | default |
+========================+==========+=========+
| list (of dictionaries) | yes | *n/a* |
+------------------------+----------+---------+
List of seeds. Each item in the list is a dictionary (associative array) which
defines the seed. It must specify ``url`` (see below) and can additionally
specify any *seed* settings.
Seed-level-only settings
------------------------
These settings can be specified only at the seed level, unlike most seed
settings, which can also be specified at the top level.
``url``
-------
+------------+--------+----------+---------+
| scope | type | required | default |
+============+========+==========+=========+
| seed-level | string | yes | *n/a* |
+------------+--------+----------+---------+
~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | yes | *n/a* |
+--------+----------+---------+
The seed url.
``username``
~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``password``
~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
Seed-level / top-level settings
-------------------------------
These are seed settings that can also be speficied at the top level, in which
case they are inherited by all seeds.
``metadata``
------------
+-----------------------+------------+----------+---------+
| scope | type | required | default |
+=======================+============+==========+=========+
| seed-level, top-level | dictionary | no | *none* |
+-----------------------+------------+----------+---------+
~~~~~~~~~~~~
+------------+----------+---------+
| type | required | default |
+============+==========+=========+
| dictionary | no | *none* |
+------------+----------+---------+
Arbitrary information about the crawl job or site. Merely informative, not used
by brozzler for anything. Could be of use to some external process.
``time_limit``
--------------
+-----------------------+--------+----------+---------+
| scope | type | required | default |
+=======================+========+==========+=========+
| seed-level, top-level | number | no | *none* |
+-----------------------+--------+----------+---------+
~~~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| number | no | *none* |
+--------+----------+---------+
Time limit in seconds. If not specified, there no time limit. Time limit is
enforced at the seed level. If a time limit is specified at the top level, it
is inherited by each seed as described above, and enforced individually on each
seed.
``proxy``
~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
HTTP proxy, with the format ``host:port``. Typically configured to point to
warcprox for archival crawling.
``ignore_robots``
-----------------
+-----------------------+---------+----------+-----------+
| scope | type | required | default |
+=======================+=========+==========+===========+
| seed-level, top-level | boolean | no | ``false`` |
+-----------------------+---------+----------+-----------+
~~~~~~~~~~~~~~~~~
+---------+----------+-----------+
| type | required | default |
+=========+==========+===========+
| boolean | no | ``false`` |
+---------+----------+-----------+
If set to ``true``, brozzler will happily crawl pages that would otherwise be
blocked by robots.txt rules.
``user_agent``
--------------
+-----------------------+---------+----------+---------+
| scope | type | required | default |
+=======================+=========+==========+=========+
| seed-level, top-level | string | no | *none* |
+-----------------------+---------+----------+---------+
~~~~~~~~~~~~~~
+---------+----------+---------+
| type | required | default |
+=========+==========+=========+
| string | no | *none* |
+---------+----------+---------+
The ``User-Agent`` header brozzler will send to identify itself to web servers.
It's good ettiquette to include a project URL with a notice to webmasters that
explains why you're crawling, how to block the crawler robots.txt and how to
contact the operator if the crawl is causing problems.
``warcprox_meta``
-----------------
+-----------------------+------------+----------+-----------+
| scope | type | required | default |
+=======================+============+==========+===========+
| seed-level, top-level | dictionary | no | ``false`` |
+-----------------------+------------+----------+-----------+
~~~~~~~~~~~~~~~~~
+------------+----------+-----------+
| type | required | default |
+============+==========+===========+
| dictionary | no | ``false`` |
+------------+----------+-----------+
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
configured. The value of the Warcprox-Meta header is a json blob. It is used to
pass settings and information to warcprox. Warcprox does not forward the header
@ -195,36 +236,95 @@ becomes::
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
``scope``
---------
+-----------------------+------------+----------+-----------+
| scope | type | required | default |
+=======================+============+==========+===========+
| seed-level, top-level | dictionary | no | ``false`` |
+-----------------------+------------+----------+-----------+
~~~~~~~~~
+------------+----------+-----------+
| type | required | default |
+============+==========+===========+
| dictionary | no | ``false`` |
+------------+----------+-----------+
Scope rules. *TODO*
Scoping
=======
*TODO* explanation of scoping and scope rules
Scope settings
--------------
``surt``
--------
+-------------+--------+----------+---------------------------+
| scope | type | required | default |
+=============+========+==========+===========================+
| scope-level | string | no | *generated from seed url* |
+-------------+--------+----------+---------------------------+
~~~~~~~~
+--------+----------+---------------------------+
| type | required | default |
+========+==========+===========================+
| string | no | *generated from seed url* |
+--------+----------+---------------------------+
``accepts``
-----------
+-------------+------+----------+---------+
| scope | type | required | default |
+=============+======+==========+=========+
| scope-level | list | no | *none* |
+-------------+------+----------+---------+
~~~~~~~~~~~
+------+----------+---------+
| type | required | default |
+======+==========+=========+
| list | no | *none* |
+------+----------+---------+
``blocks``
-----------
+-------------+------+----------+---------+
| scope | type | required | default |
+=============+======+==========+=========+
| scope-level | list | no | *none* |
+-------------+------+----------+---------+
~~~~~~~~~~~
+------+----------+---------+
| type | required | default |
+======+==========+=========+
| list | no | *none* |
+------+----------+---------+
Scope rule settings
-------------------
``domain``
~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``substring``
~~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``regex``
~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``ssurt``
~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``surt``
~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+
``parent_url_regex``
~~~~~~~~~~~~~~~~~~~~
+--------+----------+---------+
| type | required | default |
+========+==========+=========+
| string | no | *none* |
+--------+----------+---------+