document a bunch of job settings

This commit is contained in:
Noah Levitt 2016-09-29 16:15:44 -07:00
parent 8c9a9c5666
commit bfd4c1f8c6
2 changed files with 128 additions and 4 deletions

View File

@ -1,12 +1,12 @@
brozzler job configuration
==========================
**************************
Jobs are defined using yaml files. Options may be specified either at the
top-level or on individual seeds. A job id and at least one seed url
must be specified, everything else is optional.
an example
----------
==========
::
@ -37,7 +37,7 @@ an example
surt: http://(org,example,
how inheritance works
---------------------
=====================
Most of the available options apply to seeds. Such options can also be
specified at the top level, in which case the seeds inherit the options. If
@ -79,3 +79,127 @@ Notice that:
- There is a collision on ``warc-prefix`` and the seed-level value wins.
- Since ``buckets`` is a list, the merged result includes all the values from
both the top level and the seed level.
settings reference
==================
id
--
+-----------+--------+----------+---------+
| scope | type | required | default |
+===========+========+==========+=========+
| top-level | string | yes? | *n/a* |
+-----------+--------+----------+---------+
An arbitrary identifier for this job. Must be unique across this deployment of
brozzler.
seeds
-----
+-----------+------------------------+----------+---------+
| scope | type | required | default |
+===========+========================+==========+=========+
| top-level | list (of dictionaries) | yes | *n/a* |
+-----------+------------------------+----------+---------+
List of seeds. Each item in the list is a dictionary (associative array) which
defines the seed. It must specify ``url`` (see below) and can additionally
specify any of the settings of scope *seed-level*.
url
---
+------------+--------+----------+---------+
| scope | type | required | default |
+============+========+==========+=========+
| seed-level | string | yes | *n/a* |
+------------+--------+----------+---------+
The seed url.
time_limit
----------
+-----------------------+--------+----------+---------+
| scope | type | required | default |
+=======================+========+==========+=========+
| seed-level, top-level | number | no | *none* |
+-----------------------+--------+----------+---------+
Time limit in seconds. If not specified, there no time limit. Time limit is
enforced at the seed level. If a time limit is specified at the top level, it
is inherited by each seed as described above, and enforced individually on each
seed.
proxy
-----
+-----------------------+--------+----------+---------+
| scope | type | required | default |
+=======================+========+==========+=========+
| seed-level, top-level | string | no | *none* |
+-----------------------+--------+----------+---------+
HTTP proxy, with the format ``host:port``. Typically configured to point to
warcprox for archival crawling.
enable_warcprox_features
------------------------
+-----------------------+---------+----------+---------+
| scope | type | required | default |
+=======================+=========+==========+=========+
| seed-level, top-level | boolean | no | false |
+-----------------------+---------+----------+---------+
If true for a given seed, and the seed is configured to use a proxy, enables
special features that assume the proxy is an instance of warcprox. As of this
writing, the special features that are enabled are:
- sending screenshots and thumbnails to warcprox using a WARCPROX_WRITE_RECORD
request
- sending youtube-dl metadata json to warcprox using a WARCPROX_WRITE_RECORD
request
See the warcprox docs for information on the WARCPROX_WRITE_RECORD method (XXX
not yet written).
*Note that if* ``warcprox_meta`` *and* ``proxy`` *are configured, the
Warcprox-Meta header will be sent even if* ``enable_warcprox_features`` *is not
set.*
ignore_robots
-------------
+-----------------------+---------+----------+---------+
| scope | type | required | default |
+=======================+=========+==========+=========+
| seed-level, top-level | boolean | no | false |
+-----------------------+---------+----------+---------+
If set to ``true``, brozzler will happily crawl pages that would otherwise be
blocked by robots.txt rules.
warcprox_meta
-------------
+-----------------------+------------+----------+---------+
| scope | type | required | default |
+=======================+============+==========+=========+
| seed-level, top-level | dictionary | no | false |
+-----------------------+------------+----------+---------+
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
configured. The value of the Warcprox-Meta header is a json blob. It is used to
pass settings and information to warcprox. Warcprox does not forward the header
on to the remote site. See the warcprox docs for more information (XXX not yet
written).
Brozzler takes the configured value of ``warcprox_meta``, converts it to
json and populates the Warcprox-Meta header with that value. For example::
warcprox_meta:
warc-prefix: job1-seed1
stats:
buckets:
- job1-stats
- job1-seed1-stats
becomes::
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
scope
-----
+-----------------------+------------+----------+---------+
| scope | type | required | default |
+=======================+============+==========+=========+
| seed-level, top-level | dictionary | no | false |
+-----------------------+------------+----------+---------+
Scope rules. *TODO*

View File

@ -32,7 +32,7 @@ def find_package_data(package):
setuptools.setup(
name='brozzler',
version='1.1b6.dev86',
version='1.1b6.dev87',
description='Distributed web crawling with browsers',
url='https://github.com/internetarchive/brozzler',
author='Noah Levitt',