From 88214236bb1642de355b5730dc013ce192b83db5 Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Mon, 19 Mar 2018 17:23:49 -0700 Subject: [PATCH] WIP starting to flesh out "scoping" section --- job-conf.rst | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/job-conf.rst b/job-conf.rst index e5f79db..756c232 100644 --- a/job-conf.rst +++ b/job-conf.rst @@ -251,12 +251,26 @@ becomes:: +============+==========+===========+ | dictionary | no | ``false`` | +------------+----------+-----------+ -Scope rules. *TODO* +Scope specificaion for the seed. See the "Scoping" section which follows. Scoping ======= -*TODO* explanation of scoping and scope rules +The scope of a seed determines which links are scheduled for crawling and which +are not. Example:: + + scope: + accepts: + - parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$ + regex: ^https?://(www\.)?youtube.com/watch\?.*$ + - surt: +http://(com,google,video, + - surt: +http://(com,googlevideo, + blocks: + - domain: youngscholars.unimelb.edu.au + substring: wp-login.php?action=logout + - domain: malware.us + max_hops: 20 + max_hops_off_surt: 0 Scope settings -------------- @@ -285,6 +299,21 @@ Scope settings | list | no | *none* | +------+----------+---------+ +``max_hops`` +~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| number | no | *none* | ++--------+----------+---------+ + +``max_hops_off_surt`` +~~~~~~~~~~~~~~~~~~~~~ ++--------+----------+---------+ +| type | required | default | ++========+==========+=========+ +| number | no | 0 | ++--------+----------+---------+ Scope rule settings -------------------