mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-06-20 12:54:23 -04:00
document a bunch of job settings
This commit is contained in:
parent
8c9a9c5666
commit
bfd4c1f8c6
2 changed files with 128 additions and 4 deletions
130
job-conf.rst
130
job-conf.rst
|
@ -1,12 +1,12 @@
|
||||||
brozzler job configuration
|
brozzler job configuration
|
||||||
==========================
|
**************************
|
||||||
|
|
||||||
Jobs are defined using yaml files. Options may be specified either at the
|
Jobs are defined using yaml files. Options may be specified either at the
|
||||||
top-level or on individual seeds. A job id and at least one seed url
|
top-level or on individual seeds. A job id and at least one seed url
|
||||||
must be specified, everything else is optional.
|
must be specified, everything else is optional.
|
||||||
|
|
||||||
an example
|
an example
|
||||||
----------
|
==========
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
|
@ -37,7 +37,7 @@ an example
|
||||||
surt: http://(org,example,
|
surt: http://(org,example,
|
||||||
|
|
||||||
how inheritance works
|
how inheritance works
|
||||||
---------------------
|
=====================
|
||||||
|
|
||||||
Most of the available options apply to seeds. Such options can also be
|
Most of the available options apply to seeds. Such options can also be
|
||||||
specified at the top level, in which case the seeds inherit the options. If
|
specified at the top level, in which case the seeds inherit the options. If
|
||||||
|
@ -79,3 +79,127 @@ Notice that:
|
||||||
- There is a collision on ``warc-prefix`` and the seed-level value wins.
|
- There is a collision on ``warc-prefix`` and the seed-level value wins.
|
||||||
- Since ``buckets`` is a list, the merged result includes all the values from
|
- Since ``buckets`` is a list, the merged result includes all the values from
|
||||||
both the top level and the seed level.
|
both the top level and the seed level.
|
||||||
|
|
||||||
|
settings reference
|
||||||
|
==================
|
||||||
|
|
||||||
|
id
|
||||||
|
--
|
||||||
|
+-----------+--------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+===========+========+==========+=========+
|
||||||
|
| top-level | string | yes? | *n/a* |
|
||||||
|
+-----------+--------+----------+---------+
|
||||||
|
An arbitrary identifier for this job. Must be unique across this deployment of
|
||||||
|
brozzler.
|
||||||
|
|
||||||
|
seeds
|
||||||
|
-----
|
||||||
|
+-----------+------------------------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+===========+========================+==========+=========+
|
||||||
|
| top-level | list (of dictionaries) | yes | *n/a* |
|
||||||
|
+-----------+------------------------+----------+---------+
|
||||||
|
List of seeds. Each item in the list is a dictionary (associative array) which
|
||||||
|
defines the seed. It must specify ``url`` (see below) and can additionally
|
||||||
|
specify any of the settings of scope *seed-level*.
|
||||||
|
|
||||||
|
url
|
||||||
|
---
|
||||||
|
+------------+--------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+============+========+==========+=========+
|
||||||
|
| seed-level | string | yes | *n/a* |
|
||||||
|
+------------+--------+----------+---------+
|
||||||
|
The seed url.
|
||||||
|
|
||||||
|
time_limit
|
||||||
|
----------
|
||||||
|
+-----------------------+--------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+========+==========+=========+
|
||||||
|
| seed-level, top-level | number | no | *none* |
|
||||||
|
+-----------------------+--------+----------+---------+
|
||||||
|
Time limit in seconds. If not specified, there no time limit. Time limit is
|
||||||
|
enforced at the seed level. If a time limit is specified at the top level, it
|
||||||
|
is inherited by each seed as described above, and enforced individually on each
|
||||||
|
seed.
|
||||||
|
|
||||||
|
proxy
|
||||||
|
-----
|
||||||
|
+-----------------------+--------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+========+==========+=========+
|
||||||
|
| seed-level, top-level | string | no | *none* |
|
||||||
|
+-----------------------+--------+----------+---------+
|
||||||
|
HTTP proxy, with the format ``host:port``. Typically configured to point to
|
||||||
|
warcprox for archival crawling.
|
||||||
|
|
||||||
|
enable_warcprox_features
|
||||||
|
------------------------
|
||||||
|
+-----------------------+---------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+=========+==========+=========+
|
||||||
|
| seed-level, top-level | boolean | no | false |
|
||||||
|
+-----------------------+---------+----------+---------+
|
||||||
|
If true for a given seed, and the seed is configured to use a proxy, enables
|
||||||
|
special features that assume the proxy is an instance of warcprox. As of this
|
||||||
|
writing, the special features that are enabled are:
|
||||||
|
|
||||||
|
- sending screenshots and thumbnails to warcprox using a WARCPROX_WRITE_RECORD
|
||||||
|
request
|
||||||
|
- sending youtube-dl metadata json to warcprox using a WARCPROX_WRITE_RECORD
|
||||||
|
request
|
||||||
|
|
||||||
|
See the warcprox docs for information on the WARCPROX_WRITE_RECORD method (XXX
|
||||||
|
not yet written).
|
||||||
|
|
||||||
|
*Note that if* ``warcprox_meta`` *and* ``proxy`` *are configured, the
|
||||||
|
Warcprox-Meta header will be sent even if* ``enable_warcprox_features`` *is not
|
||||||
|
set.*
|
||||||
|
|
||||||
|
ignore_robots
|
||||||
|
-------------
|
||||||
|
+-----------------------+---------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+=========+==========+=========+
|
||||||
|
| seed-level, top-level | boolean | no | false |
|
||||||
|
+-----------------------+---------+----------+---------+
|
||||||
|
If set to ``true``, brozzler will happily crawl pages that would otherwise be
|
||||||
|
blocked by robots.txt rules.
|
||||||
|
|
||||||
|
warcprox_meta
|
||||||
|
-------------
|
||||||
|
+-----------------------+------------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+============+==========+=========+
|
||||||
|
| seed-level, top-level | dictionary | no | false |
|
||||||
|
+-----------------------+------------+----------+---------+
|
||||||
|
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
||||||
|
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
||||||
|
pass settings and information to warcprox. Warcprox does not forward the header
|
||||||
|
on to the remote site. See the warcprox docs for more information (XXX not yet
|
||||||
|
written).
|
||||||
|
|
||||||
|
Brozzler takes the configured value of ``warcprox_meta``, converts it to
|
||||||
|
json and populates the Warcprox-Meta header with that value. For example::
|
||||||
|
|
||||||
|
warcprox_meta:
|
||||||
|
warc-prefix: job1-seed1
|
||||||
|
stats:
|
||||||
|
buckets:
|
||||||
|
- job1-stats
|
||||||
|
- job1-seed1-stats
|
||||||
|
|
||||||
|
becomes::
|
||||||
|
|
||||||
|
Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}}
|
||||||
|
|
||||||
|
scope
|
||||||
|
-----
|
||||||
|
+-----------------------+------------+----------+---------+
|
||||||
|
| scope | type | required | default |
|
||||||
|
+=======================+============+==========+=========+
|
||||||
|
| seed-level, top-level | dictionary | no | false |
|
||||||
|
+-----------------------+------------+----------+---------+
|
||||||
|
Scope rules. *TODO*
|
||||||
|
|
2
setup.py
2
setup.py
|
@ -32,7 +32,7 @@ def find_package_data(package):
|
||||||
|
|
||||||
setuptools.setup(
|
setuptools.setup(
|
||||||
name='brozzler',
|
name='brozzler',
|
||||||
version='1.1b6.dev86',
|
version='1.1b6.dev87',
|
||||||
description='Distributed web crawling with browsers',
|
description='Distributed web crawling with browsers',
|
||||||
url='https://github.com/internetarchive/brozzler',
|
url='https://github.com/internetarchive/brozzler',
|
||||||
author='Noah Levitt',
|
author='Noah Levitt',
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue