diff --git a/job-conf.rst b/job-conf.rst index 1174f1a..670b476 100644 --- a/job-conf.rst +++ b/job-conf.rst @@ -1,17 +1,19 @@ -brozzler job configuration +Brozzler Job Configuration ************************** -Jobs are defined using yaml files. Options may be specified either at the -top-level or on individual seeds. At least one seed url must be specified, +Jobs are defined using yaml files. At least one seed url must be specified, everything else is optional. -an example -========== +.. contents:: + +Example +======= :: id: myjob time_limit: 60 # seconds + proxy: 127.0.0.1:8000 # point at warcprox for archiving ignore_robots: false max_claimed_sites: 2 warcprox_meta: @@ -35,7 +37,7 @@ an example scope: surt: http://(org,example, -how inheritance works +How inheritance works ===================== Most of the available options apply to seeds. Such options can also be @@ -79,101 +81,140 @@ Notice that: - Since ``buckets`` is a list, the merged result includes all the values from both the top level and the seed level. -settings reference +Settings reference ================== +Top-level settings +------------------ + ``id`` ------- -+-----------+--------+----------+--------------------------+ -| scope | type | required | default | -+===========+========+==========+==========================+ -| top-level | string | no | *generated by rethinkdb* | -+-----------+--------+----------+--------------------------+ +~~~~~~ ++--------+----------+--------------------------+ +| type | required | default | ++========+==========+==========================+ +| string | no | *generated by rethinkdb* | ++--------+----------+--------------------------+ An arbitrary identifier for this job. Must be unique across this deployment of brozzler. -``seeds`` ---------- -+-----------+------------------------+----------+---------+ -| scope | type | required | default | -+===========+========================+==========+=========+ -| top-level | list (of dictionaries) | yes | *n/a* | -+-----------+------------------------+----------+---------+ -List of seeds. Each item in the list is a dictionary (associative array) which -defines the seed. It must specify ``url`` (see below) and can additionally -specify any of the settings of scope *seed-level*. - ``max_claimed_sites`` ---------------------- -+-----------+--------+----------+---------+ -| scope | type | required | default | -+===========+========+==========+=========+ -| top-level | number | no | *none* | -+-----------+--------+----------+---------+ +~~~~~~~~~~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| number | no | *none* | ++--------+----------+---------+ Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. +``seeds`` +~~~~~~~~~ ++------------------------+----------+---------+ +| type | required | default | ++========================+==========+=========+ +| list (of dictionaries) | yes | *n/a* | ++------------------------+----------+---------+ +List of seeds. Each item in the list is a dictionary (associative array) which +defines the seed. It must specify ``url`` (see below) and can additionally +specify any *seed* settings. + +Seed-level-only settings +------------------------ +These settings can be specified only at the seed level, unlike most seed +settings, which can also be specified at the top level. + ``url`` -------- -+------------+--------+----------+---------+ -| scope | type | required | default | -+============+========+==========+=========+ -| seed-level | string | yes | *n/a* | -+------------+--------+----------+---------+ +~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | yes | *n/a* | ++--------+----------+---------+ The seed url. +``username`` +~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``password`` +~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +Seed-level / top-level settings +------------------------------- +These are seed settings that can also be speficied at the top level, in which +case they are inherited by all seeds. + ``metadata`` ------------- -+-----------------------+------------+----------+---------+ -| scope | type | required | default | -+=======================+============+==========+=========+ -| seed-level, top-level | dictionary | no | *none* | -+-----------------------+------------+----------+---------+ +~~~~~~~~~~~~ ++------------+----------+---------+ +| type | required | default | ++============+==========+=========+ +| dictionary | no | *none* | ++------------+----------+---------+ Arbitrary information about the crawl job or site. Merely informative, not used by brozzler for anything. Could be of use to some external process. ``time_limit`` --------------- -+-----------------------+--------+----------+---------+ -| scope | type | required | default | -+=======================+========+==========+=========+ -| seed-level, top-level | number | no | *none* | -+-----------------------+--------+----------+---------+ +~~~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| number | no | *none* | ++--------+----------+---------+ Time limit in seconds. If not specified, there no time limit. Time limit is enforced at the seed level. If a time limit is specified at the top level, it is inherited by each seed as described above, and enforced individually on each seed. +``proxy`` +~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ +HTTP proxy, with the format ``host:port``. Typically configured to point to +warcprox for archival crawling. + ``ignore_robots`` ------------------ -+-----------------------+---------+----------+-----------+ -| scope | type | required | default | -+=======================+=========+==========+===========+ -| seed-level, top-level | boolean | no | ``false`` | -+-----------------------+---------+----------+-----------+ +~~~~~~~~~~~~~~~~~ ++---------+----------+-----------+ +| type | required | default | ++=========+==========+===========+ +| boolean | no | ``false`` | ++---------+----------+-----------+ If set to ``true``, brozzler will happily crawl pages that would otherwise be blocked by robots.txt rules. ``user_agent`` --------------- -+-----------------------+---------+----------+---------+ -| scope | type | required | default | -+=======================+=========+==========+=========+ -| seed-level, top-level | string | no | *none* | -+-----------------------+---------+----------+---------+ +~~~~~~~~~~~~~~ ++---------+----------+---------+ +| type | required | default | ++=========+==========+=========+ +| string | no | *none* | ++---------+----------+---------+ The ``User-Agent`` header brozzler will send to identify itself to web servers. It's good ettiquette to include a project URL with a notice to webmasters that explains why you're crawling, how to block the crawler robots.txt and how to contact the operator if the crawl is causing problems. ``warcprox_meta`` ------------------ -+-----------------------+------------+----------+-----------+ -| scope | type | required | default | -+=======================+============+==========+===========+ -| seed-level, top-level | dictionary | no | ``false`` | -+-----------------------+------------+----------+-----------+ +~~~~~~~~~~~~~~~~~ ++------------+----------+-----------+ +| type | required | default | ++============+==========+===========+ +| dictionary | no | ``false`` | ++------------+----------+-----------+ Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is configured. The value of the Warcprox-Meta header is a json blob. It is used to pass settings and information to warcprox. Warcprox does not forward the header @@ -195,36 +236,95 @@ becomes:: Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}} ``scope`` ---------- -+-----------------------+------------+----------+-----------+ -| scope | type | required | default | -+=======================+============+==========+===========+ -| seed-level, top-level | dictionary | no | ``false`` | -+-----------------------+------------+----------+-----------+ +~~~~~~~~~ ++------------+----------+-----------+ +| type | required | default | ++============+==========+===========+ +| dictionary | no | ``false`` | ++------------+----------+-----------+ Scope rules. *TODO* +Scoping +======= + +*TODO* explanation of scoping and scope rules + +Scope settings +-------------- + ``surt`` --------- -+-------------+--------+----------+---------------------------+ -| scope | type | required | default | -+=============+========+==========+===========================+ -| scope-level | string | no | *generated from seed url* | -+-------------+--------+----------+---------------------------+ +~~~~~~~~ ++--------+----------+---------------------------+ +| type | required | default | ++========+==========+===========================+ +| string | no | *generated from seed url* | ++--------+----------+---------------------------+ ``accepts`` ------------ -+-------------+------+----------+---------+ -| scope | type | required | default | -+=============+======+==========+=========+ -| scope-level | list | no | *none* | -+-------------+------+----------+---------+ +~~~~~~~~~~~ ++------+----------+---------+ +| type | required | default | ++======+==========+=========+ +| list | no | *none* | ++------+----------+---------+ ``blocks`` ------------ -+-------------+------+----------+---------+ -| scope | type | required | default | -+=============+======+==========+=========+ -| scope-level | list | no | *none* | -+-------------+------+----------+---------+ +~~~~~~~~~~~ ++------+----------+---------+ +| type | required | default | ++======+==========+=========+ +| list | no | *none* | ++------+----------+---------+ +Scope rule settings +------------------- + +``domain`` +~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``substring`` +~~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``regex`` +~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``ssurt`` +~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``surt`` +~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ + +``parent_url_regex`` +~~~~~~~~~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| string | no | *none* | ++--------+----------+---------+ +