mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-04-19 23:35:54 -04:00
document a bunch of job settings
This commit is contained in:
parent
8c9a9c5666
commit
bfd4c1f8c6
130
job-conf.rst
130
job-conf.rst
@ -1,12 +1,12 @@
|
||||
brozzler job configuration
|
||||
==========================
|
||||
**************************
|
||||
|
||||
Jobs are defined using yaml files. Options may be specified either at the
|
||||
top-level or on individual seeds. A job id and at least one seed url
|
||||
must be specified, everything else is optional.
|
||||
|
||||
an example
|
||||
----------
|
||||
==========
|
||||
|
||||
::
|
||||
|
||||
@ -37,7 +37,7 @@ an example
|
||||
surt: http://(org,example,
|
||||
|
||||
how inheritance works
|
||||
---------------------
|
||||
=====================
|
||||
|
||||
Most of the available options apply to seeds. Such options can also be
|
||||
specified at the top level, in which case the seeds inherit the options. If
|
||||
@ -79,3 +79,127 @@ Notice that:
|
||||
- There is a collision on ``warc-prefix`` and the seed-level value wins.
|
||||
- Since ``buckets`` is a list, the merged result includes all the values from
|
||||
both the top level and the seed level.
|
||||
|
||||
settings reference
|
||||
==================
|
||||
|
||||
id
|
||||
--
|
||||
+-----------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+===========+========+==========+=========+
|
||||
| top-level | string | yes? | *n/a* |
|
||||
+-----------+--------+----------+---------+
|
||||
An arbitrary identifier for this job. Must be unique across this deployment of
|
||||
brozzler.
|
||||
|
||||
seeds
|
||||
-----
|
||||
+-----------+------------------------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+===========+========================+==========+=========+
|
||||
| top-level | list (of dictionaries) | yes | *n/a* |
|
||||
+-----------+------------------------+----------+---------+
|
||||
List of seeds. Each item in the list is a dictionary (associative array) which
|
||||
defines the seed. It must specify ``url`` (see below) and can additionally
|
||||
specify any of the settings of scope *seed-level*.
|
||||
|
||||
url
|
||||
---
|
||||
+------------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+============+========+==========+=========+
|
||||
| seed-level | string | yes | *n/a* |
|
||||
+------------+--------+----------+---------+
|
||||
The seed url.
|
||||
|
||||
time_limit
|
||||
----------
|
||||
+-----------------------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+========+==========+=========+
|
||||
| seed-level, top-level | number | no | *none* |
|
||||
+-----------------------+--------+----------+---------+
|
||||
Time limit in seconds. If not specified, there no time limit. Time limit is
|
||||
enforced at the seed level. If a time limit is specified at the top level, it
|
||||
is inherited by each seed as described above, and enforced individually on each
|
||||
seed.
|
||||
|
||||
proxy
|
||||
-----
|
||||
+-----------------------+--------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+========+==========+=========+
|
||||
| seed-level, top-level | string | no | *none* |
|
||||
+-----------------------+--------+----------+---------+
|
||||
HTTP proxy, with the format ``host:port``. Typically configured to point to
|
||||
warcprox for archival crawling.
|
||||
|
||||
enable_warcprox_features
|
||||
------------------------
|
||||
+-----------------------+---------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+=========+==========+=========+
|
||||
| seed-level, top-level | boolean | no | false |
|
||||
+-----------------------+---------+----------+---------+
|
||||
If true for a given seed, and the seed is configured to use a proxy, enables
|
||||
special features that assume the proxy is an instance of warcprox. As of this
|
||||
writing, the special features that are enabled are:
|
||||
|
||||
- sending screenshots and thumbnails to warcprox using a WARCPROX_WRITE_RECORD
|
||||
request
|
||||
- sending youtube-dl metadata json to warcprox using a WARCPROX_WRITE_RECORD
|
||||
request
|
||||
|
||||
See the warcprox docs for information on the WARCPROX_WRITE_RECORD method (XXX
|
||||
not yet written).
|
||||
|
||||
*Note that if* ``warcprox_meta`` *and* ``proxy`` *are configured, the
|
||||
Warcprox-Meta header will be sent even if* ``enable_warcprox_features`` *is not
|
||||
set.*
|
||||
|
||||
ignore_robots
|
||||
-------------
|
||||
+-----------------------+---------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+=========+==========+=========+
|
||||
| seed-level, top-level | boolean | no | false |
|
||||
+-----------------------+---------+----------+---------+
|
||||
If set to ``true``, brozzler will happily crawl pages that would otherwise be
|
||||
blocked by robots.txt rules.
|
||||
|
||||
warcprox_meta
|
||||
-------------
|
||||
+-----------------------+------------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+============+==========+=========+
|
||||
| seed-level, top-level | dictionary | no | false |
|
||||
+-----------------------+------------+----------+---------+
|
||||
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
||||
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
||||
pass settings and information to warcprox. Warcprox does not forward the header
|
||||
on to the remote site. See the warcprox docs for more information (XXX not yet
|
||||
written).
|
||||
|
||||
Brozzler takes the configured value of ``warcprox_meta``, converts it to
|
||||
json and populates the Warcprox-Meta header with that value. For example::
|
||||
|
||||
warcprox_meta:
|
||||
warc-prefix: job1-seed1
|
||||
stats:
|
||||
buckets:
|
||||
- job1-stats
|
||||
- job1-seed1-stats
|
||||
|
||||
becomes::
|
||||
|
||||
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
|
||||
|
||||
scope
|
||||
-----
|
||||
+-----------------------+------------+----------+---------+
|
||||
| scope | type | required | default |
|
||||
+=======================+============+==========+=========+
|
||||
| seed-level, top-level | dictionary | no | false |
|
||||
+-----------------------+------------+----------+---------+
|
||||
Scope rules. *TODO*
|
||||
|
Loading…
x
Reference in New Issue
Block a user