brozzler/job-conf.rst

Brozzler Job Configuration
**************************

Jobs are defined using yaml files. At least one seed url must be specified,
everything else is optional.

.. contents::

Example
=======

::

    id: myjob
    time_limit: 60 # seconds
    proxy: 127.0.0.1:8000 # point at warcprox for archiving
    ignore_robots: false
    max_claimed_sites: 2
    warcprox_meta:
      warc-prefix: job1
      stats:
        buckets:
        - job1-stats
    metadata: {}
    seeds:
    - url: http://one.example.org/
      warcprox_meta:
        warc-prefix: job1-seed1
        stats:
          buckets:
          - job1-seed1-stats
    - url: http://two.example.org/
      time_limit: 30
    - url: http://three.example.org/
      time_limit: 10
      ignore_robots: true
      scope:
        surt: http://(org,example,

How inheritance works
=====================

Most of the settings that apply to seeds can also be specified at the top
level, in which case all seeds inherit those settings. If an option is
specified both at the top level and at the level of an individual seed, the
results are merged with the seed-level value taking precedence in case of
conflicts. It's probably easiest to make sense of this by way of an example.

In the example yaml above, ``warcprox_meta`` is specified at the top level and
at the seed level for the seed http://one.example.org/. At the top level we
have::

  warcprox_meta:
    warc-prefix: job1
    stats:
      buckets:
      - job1-stats

At the seed level we have::

    warcprox_meta:
      warc-prefix: job1-seed1
      stats:
        buckets:
        - job1-seed1-stats

The merged configuration as applied to the seed http://one.example.org/ will
be::

    warcprox_meta:
      warc-prefix: job1-seed1
      stats:
        buckets:
        - job1-stats
        - job1-seed1-stats

Notice that:

- There is a collision on ``warc-prefix`` and the seed-level value wins.
- Since ``buckets`` is a list, the merged result includes all the values from
  both the top level and the seed level.

Settings reference
==================

Top-level settings
------------------

``id``
~~~~~~
+--------+----------+--------------------------+
| type   | required | default                  |
+========+==========+==========================+
| string | no       | *generated by rethinkdb* |
+--------+----------+--------------------------+
An arbitrary identifier for this job. Must be unique across this deployment of
brozzler.

``max_claimed_sites``
~~~~~~~~~~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| number | no       | *none*  |
+--------+----------+---------+
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs.

``seeds``
~~~~~~~~~
+------------------------+----------+---------+
| type                   | required | default |
+========================+==========+=========+
| list (of dictionaries) | yes      | *n/a*   |
+------------------------+----------+---------+
List of seeds. Each item in the list is a dictionary (associative array) which
defines the seed. It must specify ``url`` (see below) and can additionally
specify any seed settings.

Seed-level-only settings
------------------------
These settings can be specified only at the seed level, unlike most seed
settings, which can also be specified at the top level.

``url``
~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | yes      | *n/a*   |
+--------+----------+---------+
The seed url. Crawling starts here.

``username``
~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+
If set, used to populate automatically detected login forms. See explanation at
"password" below.

``password``
~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+
If set, used to populate automatically detected login forms. If ``username``
and ``password`` are configured for a seed, brozzler will look for a login form
on each page it crawls for that seed. A form that has a single text or email
field (the username), a single password field (``<input type="password">``),
and has ``method="POST"`` is considered to be a login form. The form may have
other fields like checkboxes and hidden fields. For these, brozzler will leave
the default values in place. Login form detection and submission happen after
page load, then brozzling proceeds as usual.

Seed-level / top-level settings
-------------------------------
These are seed settings that can also be speficied at the top level, in which
case they are inherited by all seeds.

``metadata``
~~~~~~~~~~~~
+------------+----------+---------+
| type       | required | default |
+============+==========+=========+
| dictionary | no      | *none*   |
+------------+----------+---------+
Arbitrary information about the crawl job or site. Merely informative, not used
by brozzler for anything. Could be of use to some external process.

``time_limit``
~~~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| number | no       | *none*  |
+--------+----------+---------+
Time limit in seconds. If not specified, there no time limit. Time limit is
enforced at the seed level. If a time limit is specified at the top level, it
is inherited by each seed as described above, and enforced individually on each
seed.

``proxy``
~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+
HTTP proxy, with the format ``host:port``. Typically configured to point to
warcprox for archival crawling.

``ignore_robots``
~~~~~~~~~~~~~~~~~
+---------+----------+-----------+
| type    | required | default   |
+=========+==========+===========+
| boolean | no       | ``false`` |
+---------+----------+-----------+
If set to ``true``, brozzler will happily crawl pages that would otherwise be
blocked by robots.txt rules.

``user_agent``
~~~~~~~~~~~~~~
+---------+----------+---------+
| type    | required | default |
+=========+==========+=========+
| string  | no       | *none*  |
+---------+----------+---------+
The ``User-Agent`` header brozzler will send to identify itself to web servers.
It's good ettiquette to include a project URL with a notice to webmasters that
explains why you're crawling, how to block the crawler robots.txt and how to
contact the operator if the crawl is causing problems.

``warcprox_meta``
~~~~~~~~~~~~~~~~~
+------------+----------+-----------+
| type       | required | default   |
+============+==========+===========+
| dictionary | no       | ``false`` |
+------------+----------+-----------+
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
configured. The value of the Warcprox-Meta header is a json blob. It is used to
pass settings and information to warcprox. Warcprox does not forward the header
on to the remote site. See the warcprox docs for more information (XXX not yet
written).

Brozzler takes the configured value of ``warcprox_meta``, converts it to
json and populates the Warcprox-Meta header with that value. For example::

    warcprox_meta:
      warc-prefix: job1-seed1
      stats:
        buckets:
        - job1-stats
        - job1-seed1-stats

becomes::

    Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}

``scope``
~~~~~~~~~
+------------+----------+-----------+
| type       | required | default   |
+============+==========+===========+
| dictionary | no       | ``false`` |
+------------+----------+-----------+
Scope rules. *TODO*

Scoping
=======

*TODO* explanation of scoping and scope rules

Scope settings
--------------

``surt``
~~~~~~~~
+--------+----------+---------------------------+
| type   | required | default                   |
+========+==========+===========================+
| string | no       | *generated from seed url* |
+--------+----------+---------------------------+

``accepts``
~~~~~~~~~~~
+------+----------+---------+
| type | required | default |
+======+==========+=========+
| list | no       | *none*  |
+------+----------+---------+

``blocks``
~~~~~~~~~~~
+------+----------+---------+
| type | required | default |
+======+==========+=========+
| list | no       | *none*  |
+------+----------+---------+


Scope rule settings
-------------------

``domain``
~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+

``substring``
~~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+

``regex``
~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+

``ssurt``
~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+

``surt``
~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+

``parent_url_regex``
~~~~~~~~~~~~~~~~~~~~
+--------+----------+---------+
| type   | required | default |
+========+==========+=========+
| string | no       | *none*  |
+--------+----------+---------+
WIP documentation! 2018-03-19 14:14:37 -07:00			`Brozzler Job Configuration`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`**************************`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			`Jobs are defined using yaml files. At least one seed url must be specified,`
let rethinkdb generate job.id if not supplied in configuration 2017-02-03 14:53:50 -08:00			`everything else is optional.`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			`.. contents::`

			`Example`
			`=======`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00
			`::`

			`id: myjob`
			`time_limit: 60 # seconds`
WIP documentation! 2018-03-19 14:14:37 -07:00			`proxy: 127.0.0.1:8000 # point at warcprox for archiving`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00			`ignore_robots: false`
new job setting max_claimed_sites Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. For AITFIVE-1578. 2018-03-01 17:17:54 -08:00			`max_claimed_sites: 2`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00			`warcprox_meta:`
			`warc-prefix: job1`
			`stats:`
			`buckets:`
			`- job1-stats`
			`metadata: {}`
			`seeds:`
			`- url: http://one.example.org/`
			`warcprox_meta:`
			`warc-prefix: job1-seed1`
			`stats:`
			`buckets:`
			`- job1-seed1-stats`
			`- url: http://two.example.org/`
			`time_limit: 30`
			`- url: http://three.example.org/`
			`time_limit: 10`
			`ignore_robots: true`
			`scope:`
			`surt: http://(org,example,`

WIP documentation! 2018-03-19 14:14:37 -07:00			`How inheritance works`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`=====================`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00
WIP some explanation of automatic login 2018-03-19 16:54:17 -07:00			`Most of the settings that apply to seeds can also be specified at the top`
			`level, in which case all seeds inherit those settings. If an option is`
			`specified both at the top level and at the level of an individual seed, the`
			`results are merged with the seed-level value taking precedence in case of`
			`conflicts. It's probably easiest to make sense of this by way of an example.`
starting on documenting job configuration 2016-09-29 12:03:16 -07:00
			In the example yaml above, ``warcprox_meta`` is specified at the top level and
			`at the seed level for the seed http://one.example.org/. At the top level we`
			`have::`

			`warcprox_meta:`
			`warc-prefix: job1`
			`stats:`
			`buckets:`
			`- job1-stats`

			`At the seed level we have::`

			`warcprox_meta:`
			`warc-prefix: job1-seed1`
			`stats:`
			`buckets:`
			`- job1-seed1-stats`

			`The merged configuration as applied to the seed http://one.example.org/ will`
			`be::`

			`warcprox_meta:`
			`warc-prefix: job1-seed1`
			`stats:`
			`buckets:`
			`- job1-stats`
			`- job1-seed1-stats`

			`Notice that:`

			- There is a collision on ``warc-prefix`` and the seed-level value wins.
			- Since ``buckets`` is a list, the merged result includes all the values from
			`both the top level and the seed level.`
document a bunch of job settings 2016-09-29 16:15:44 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			`Settings reference`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`==================`

WIP documentation! 2018-03-19 14:14:37 -07:00			`Top-level settings`
			`------------------`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``id``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~`
			`+--------+----------+--------------------------+`
			`\| type \| required \| default \|`
			`+========+==========+==========================+`
			`\| string \| no \| generated by rethinkdb \|`
			`+--------+----------+--------------------------+`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`An arbitrary identifier for this job. Must be unique across this deployment of`
			`brozzler.`

new job setting max_claimed_sites Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. For AITFIVE-1578. 2018-03-01 17:17:54 -08:00			``max_claimed_sites``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| number \| no \| none \|`
			`+--------+----------+---------+`
new job setting max_claimed_sites Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. For AITFIVE-1578. 2018-03-01 17:17:54 -08:00			`Puts a cap on the number of sites belonging to a given job that can be brozzled`
			`simultaneously across the cluster. Addresses the problem of a job with many`
			`seeds starving out other jobs.`

WIP documentation! 2018-03-19 14:14:37 -07:00			``seeds``
			`~~~~~~~~~`
			`+------------------------+----------+---------+`
			`\| type \| required \| default \|`
			`+========================+==========+=========+`
			`\| list (of dictionaries) \| yes \| n/a \|`
			`+------------------------+----------+---------+`
			`List of seeds. Each item in the list is a dictionary (associative array) which`
			defines the seed. It must specify ``url`` (see below) and can additionally
WIP some explanation of automatic login 2018-03-19 16:54:17 -07:00			`specify any seed settings.`
WIP documentation! 2018-03-19 14:14:37 -07:00
			`Seed-level-only settings`
			`------------------------`
			`These settings can be specified only at the seed level, unlike most seed`
			`settings, which can also be specified at the top level.`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``url``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| yes \| n/a \|`
			`+--------+----------+---------+`
WIP some explanation of automatic login 2018-03-19 16:54:17 -07:00			`The seed url. Crawling starts here.`
document a bunch of job settings 2016-09-29 16:15:44 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			``username``
			`~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`
WIP some explanation of automatic login 2018-03-19 16:54:17 -07:00			`If set, used to populate automatically detected login forms. See explanation at`
			`"password" below.`
WIP documentation! 2018-03-19 14:14:37 -07:00
			``password``
			`~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`
WIP some explanation of automatic login 2018-03-19 16:54:17 -07:00			If set, used to populate automatically detected login forms. If ``username``
			and ``password`` are configured for a seed, brozzler will look for a login form
			`on each page it crawls for that seed. A form that has a single text or email`
			field (the username), a single password field (``<input type="password">``),
			and has ``method="POST"`` is considered to be a login form. The form may have
			`other fields like checkboxes and hidden fields. For these, brozzler will leave`
			`the default values in place. Login form detection and submission happen after`
			`page load, then brozzling proceeds as usual.`
WIP documentation! 2018-03-19 14:14:37 -07:00
			`Seed-level / top-level settings`
			`-------------------------------`
			`These are seed settings that can also be speficied at the top level, in which`
			`case they are inherited by all seeds.`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``metadata``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~`
			`+------------+----------+---------+`
			`\| type \| required \| default \|`
			`+============+==========+=========+`
			`\| dictionary \| no \| none \|`
			`+------------+----------+---------+`
new job setting max_claimed_sites Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. For AITFIVE-1578. 2018-03-01 17:17:54 -08:00			`Arbitrary information about the crawl job or site. Merely informative, not used`
			`by brozzler for anything. Could be of use to some external process.`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
			``time_limit``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| number \| no \| none \|`
			`+--------+----------+---------+`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`Time limit in seconds. If not specified, there no time limit. Time limit is`
			`enforced at the seed level. If a time limit is specified at the top level, it`
			`is inherited by each seed as described above, and enforced individually on each`
			`seed.`

WIP documentation! 2018-03-19 14:14:37 -07:00			``proxy``
			`~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`
			HTTP proxy, with the format ``host:port``. Typically configured to point to
			`warcprox for archival crawling.`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``ignore_robots``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~~~~~~`
			`+---------+----------+-----------+`
			`\| type \| required \| default \|`
			`+=========+==========+===========+`
			\| boolean \| no \| ``false`` \|
			`+---------+----------+-----------+`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			If set to ``true``, brozzler will happily crawl pages that would otherwise be
			`blocked by robots.txt rules.`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``user_agent``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~~~`
			`+---------+----------+---------+`
			`\| type \| required \| default \|`
			`+=========+==========+=========+`
			`\| string \| no \| none \|`
			`+---------+----------+---------+`
Add user_agent option Currently doesn't apply to requests made by youtube-dl as I couldn't see a thread-safe way of doing that. 2016-10-05 04:25:09 +11:00			The ``User-Agent`` header brozzler will send to identify itself to web servers.
			`It's good ettiquette to include a project URL with a notice to webmasters that`
			`explains why you're crawling, how to block the crawler robots.txt and how to`
			`contact the operator if the crawl is causing problems.`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``warcprox_meta``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~~~~~~~`
			`+------------+----------+-----------+`
			`\| type \| required \| default \|`
			`+============+==========+===========+`
			\| dictionary \| no \| ``false`` \|
			`+------------+----------+-----------+`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
			`configured. The value of the Warcprox-Meta header is a json blob. It is used to`
			`pass settings and information to warcprox. Warcprox does not forward the header`
			`on to the remote site. See the warcprox docs for more information (XXX not yet`
			`written).`

			Brozzler takes the configured value of ``warcprox_meta``, converts it to
			`json and populates the Warcprox-Meta header with that value. For example::`

			`warcprox_meta:`
			`warc-prefix: job1-seed1`
			`stats:`
			`buckets:`
			`- job1-stats`
			`- job1-seed1-stats`

			`becomes::`

			`Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``scope``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~`
			`+------------+----------+-----------+`
			`\| type \| required \| default \|`
			`+============+==========+===========+`
			\| dictionary \| no \| ``false`` \|
			`+------------+----------+-----------+`
document a bunch of job settings 2016-09-29 16:15:44 -07:00			`Scope rules. TODO`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			`Scoping`
			`=======`

			`TODO explanation of scoping and scope rules`

			`Scope settings`
			`--------------`

shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00			``surt``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~`
			`+--------+----------+---------------------------+`
			`\| type \| required \| default \|`
			`+========+==========+===========================+`
			`\| string \| no \| generated from seed url \|`
			`+--------+----------+---------------------------+`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
			``accepts``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~`
			`+------+----------+---------+`
			`\| type \| required \| default \|`
			`+======+==========+=========+`
			`\| list \| no \| none \|`
			`+------+----------+---------+`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
			``blocks``
WIP documentation! 2018-03-19 14:14:37 -07:00			`~~~~~~~~~~~`
			`+------+----------+---------+`
			`\| type \| required \| default \|`
			`+======+==========+=========+`
			`\| list \| no \| none \|`
			`+------+----------+---------+`


			`Scope rule settings`
			`-------------------`

			``domain``
			`~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`

			``substring``
			`~~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`

			``regex``
			`~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`

			``ssurt``
			`~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
WIP documentation! 2018-03-19 14:14:37 -07:00			``surt``
			`~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`

			``parent_url_regex``
			`~~~~~~~~~~~~~~~~~~~~`
			`+--------+----------+---------+`
			`\| type \| required \| default \|`
			`+========+==========+=========+`
			`\| string \| no \| none \|`
			`+--------+----------+---------+`
shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00