Brozzler Job Configuration ************************** Jobs are defined using yaml files. At least one seed url must be specified, everything else is optional. .. contents:: Example ======= :: id: myjob time_limit: 60 # seconds proxy: 127.0.0.1:8000 # point at warcprox for archiving ignore_robots: false max_claimed_sites: 2 warcprox_meta: warc-prefix: job1 stats: buckets: - job1-stats metadata: {} seeds: - url: http://one.example.org/ warcprox_meta: warc-prefix: job1-seed1 stats: buckets: - job1-seed1-stats - url: http://two.example.org/ time_limit: 30 - url: http://three.example.org/ time_limit: 10 ignore_robots: true scope: surt: http://(org,example, How inheritance works ===================== Most of the settings that apply to seeds can also be specified at the top level, in which case all seeds inherit those settings. If an option is specified both at the top level and at the level of an individual seed, the results are merged with the seed-level value taking precedence in case of conflicts. It's probably easiest to make sense of this by way of an example. In the example yaml above, ``warcprox_meta`` is specified at the top level and at the seed level for the seed http://one.example.org/. At the top level we have:: warcprox_meta: warc-prefix: job1 stats: buckets: - job1-stats At the seed level we have:: warcprox_meta: warc-prefix: job1-seed1 stats: buckets: - job1-seed1-stats The merged configuration as applied to the seed http://one.example.org/ will be:: warcprox_meta: warc-prefix: job1-seed1 stats: buckets: - job1-stats - job1-seed1-stats Notice that: - There is a collision on ``warc-prefix`` and the seed-level value wins. - Since ``buckets`` is a list, the merged result includes all the values from both the top level and the seed level. Settings reference ================== Top-level settings ------------------ ``id`` ~~~~~~ +--------+----------+--------------------------+ | type | required | default | +========+==========+==========================+ | string | no | *generated by rethinkdb* | +--------+----------+--------------------------+ An arbitrary identifier for this job. Must be unique across this deployment of brozzler. ``max_claimed_sites`` ~~~~~~~~~~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | number | no | *none* | +--------+----------+---------+ Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. ``seeds`` ~~~~~~~~~ +------------------------+----------+---------+ | type | required | default | +========================+==========+=========+ | list (of dictionaries) | yes | *n/a* | +------------------------+----------+---------+ List of seeds. Each item in the list is a dictionary (associative array) which defines the seed. It must specify ``url`` (see below) and can additionally specify any seed settings. Seed-level-only settings ------------------------ These settings can be specified only at the seed level, unlike most seed settings, which can also be specified at the top level. ``url`` ~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | yes | *n/a* | +--------+----------+---------+ The seed url. Crawling starts here. ``username`` ~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ If set, used to populate automatically detected login forms. See explanation at "password" below. ``password`` ~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ If set, used to populate automatically detected login forms. If ``username`` and ``password`` are configured for a seed, brozzler will look for a login form on each page it crawls for that seed. A form that has a single text or email field (the username), a single password field (````), and has ``method="POST"`` is considered to be a login form. The form may have other fields like checkboxes and hidden fields. For these, brozzler will leave the default values in place. Login form detection and submission happen after page load, then brozzling proceeds as usual. Seed-level / top-level settings ------------------------------- These are seed settings that can also be speficied at the top level, in which case they are inherited by all seeds. ``metadata`` ~~~~~~~~~~~~ +------------+----------+---------+ | type | required | default | +============+==========+=========+ | dictionary | no | *none* | +------------+----------+---------+ Arbitrary information about the crawl job or site. Merely informative, not used by brozzler for anything. Could be of use to some external process. ``time_limit`` ~~~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | number | no | *none* | +--------+----------+---------+ Time limit in seconds. If not specified, there no time limit. Time limit is enforced at the seed level. If a time limit is specified at the top level, it is inherited by each seed as described above, and enforced individually on each seed. ``proxy`` ~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ HTTP proxy, with the format ``host:port``. Typically configured to point to warcprox for archival crawling. ``ignore_robots`` ~~~~~~~~~~~~~~~~~ +---------+----------+-----------+ | type | required | default | +=========+==========+===========+ | boolean | no | ``false`` | +---------+----------+-----------+ If set to ``true``, brozzler will happily crawl pages that would otherwise be blocked by robots.txt rules. ``user_agent`` ~~~~~~~~~~~~~~ +---------+----------+---------+ | type | required | default | +=========+==========+=========+ | string | no | *none* | +---------+----------+---------+ The ``User-Agent`` header brozzler will send to identify itself to web servers. It's good ettiquette to include a project URL with a notice to webmasters that explains why you're crawling, how to block the crawler robots.txt and how to contact the operator if the crawl is causing problems. ``warcprox_meta`` ~~~~~~~~~~~~~~~~~ +------------+----------+-----------+ | type | required | default | +============+==========+===========+ | dictionary | no | ``false`` | +------------+----------+-----------+ Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is configured. The value of the Warcprox-Meta header is a json blob. It is used to pass settings and information to warcprox. Warcprox does not forward the header on to the remote site. See the warcprox docs for more information (XXX not yet written). Brozzler takes the configured value of ``warcprox_meta``, converts it to json and populates the Warcprox-Meta header with that value. For example:: warcprox_meta: warc-prefix: job1-seed1 stats: buckets: - job1-stats - job1-seed1-stats becomes:: Warcprox-Meta: {"warc-prefix":"job1-seed1","stats":{"buckets":["job1-stats","job1-seed1-stats"]}} ``scope`` ~~~~~~~~~ +------------+----------+-----------+ | type | required | default | +============+==========+===========+ | dictionary | no | ``false`` | +------------+----------+-----------+ Scope specificaion for the seed. See the "Scoping" section which follows. Scoping ======= The scope of a seed determines which links are scheduled for crawling and which are not. Example:: scope: accepts: - parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$ regex: ^https?://(www\.)?youtube.com/watch\?.*$ - surt: +http://(com,google,video, - surt: +http://(com,googlevideo, blocks: - domain: youngscholars.unimelb.edu.au substring: wp-login.php?action=logout - domain: malware.us max_hops: 20 max_hops_off_surt: 0 Scope settings -------------- ``surt`` ~~~~~~~~ +--------+----------+---------------------------+ | type | required | default | +========+==========+===========================+ | string | no | *generated from seed url* | +--------+----------+---------------------------+ ``accepts`` ~~~~~~~~~~~ +------+----------+---------+ | type | required | default | +======+==========+=========+ | list | no | *none* | +------+----------+---------+ ``blocks`` ~~~~~~~~~~~ +------+----------+---------+ | type | required | default | +======+==========+=========+ | list | no | *none* | +------+----------+---------+ ``max_hops`` ~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | number | no | *none* | +--------+----------+---------+ ``max_hops_off_surt`` ~~~~~~~~~~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | number | no | 0 | +--------+----------+---------+ Scope rule settings ------------------- ``domain`` ~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ ``substring`` ~~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ ``regex`` ~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ ``ssurt`` ~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ ``surt`` ~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+ ``parent_url_regex`` ~~~~~~~~~~~~~~~~~~~~ +--------+----------+---------+ | type | required | default | +========+==========+=========+ | string | no | *none* | +--------+----------+---------+