mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-07-26 00:05:42 -04:00
more explication of scoping
This commit is contained in:
parent
2cf474aa1d
commit
a327cb626f
1 changed files with 41 additions and 11 deletions
52
job-conf.rst
52
job-conf.rst
|
@ -261,9 +261,9 @@ are not. Example::
|
||||||
|
|
||||||
scope:
|
scope:
|
||||||
accepts:
|
accepts:
|
||||||
|
- ssurt: com,example,//https:/
|
||||||
- parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$
|
- parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$
|
||||||
regex: ^https?://(www\.)?youtube.com/watch\?.*$
|
regex: ^https?://(www\.)?youtube.com/watch\?.*$
|
||||||
- ssurt: com,example,//https:/
|
|
||||||
- surt: http://(com,google,video,
|
- surt: http://(com,google,video,
|
||||||
- surt: http://(com,googlevideo,
|
- surt: http://(com,googlevideo,
|
||||||
blocks:
|
blocks:
|
||||||
|
@ -279,14 +279,43 @@ each link to determine whether it is in scope or out of scope for the crawl.
|
||||||
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
||||||
previously discovered links get a priority bump.
|
previously discovered links get a priority bump.
|
||||||
|
|
||||||
How scope rules are applied
|
Applying scope rules
|
||||||
---------------------------
|
--------------------
|
||||||
1. If any ``block`` rule matches, the url is out of scope.
|
|
||||||
2. Otherwise, if any ``accept`` rule matches, the url is in scope.
|
|
||||||
3. Otherwise (no rules match), the url is out of scope.
|
|
||||||
|
|
||||||
In other words, by default urls are not in scope, and ``block`` rules take
|
Each scope rule has one or more conditions. If all of the conditions match,
|
||||||
precedence over ``accept`` rules.
|
then the scope rule as a whole matches. For example::
|
||||||
|
|
||||||
|
- domain: youngscholars.unimelb.edu.au
|
||||||
|
substring: wp-login.php?action=logout
|
||||||
|
|
||||||
|
This rule applies if the domain of the url is "youngscholars.unimelb.edu.au" or
|
||||||
|
a subdomain, and the string "wp-login.php?action=logout" is found somewhere in
|
||||||
|
the url.
|
||||||
|
|
||||||
|
Brozzler applies these logical steps to decide whether a page url is in or out
|
||||||
|
of scope:
|
||||||
|
|
||||||
|
1. If the number of hops from seed is greater than ``max_hops``, the url is
|
||||||
|
**out of scope**.
|
||||||
|
2. Otherwise, if any ``block`` rule matches, the url is **out of scope**.
|
||||||
|
3. Otherwise, if any ``accept`` rule matches, the url is **in scope**.
|
||||||
|
4. Otherwise, if the url is at most ``max_hops_off`` hops from the last page
|
||||||
|
that was in scope thanks to an ``accept`` rule, the url is **in scope**.
|
||||||
|
5. Otherwise (no rules match), the url is **out of scope**.
|
||||||
|
|
||||||
|
Notably, ``block`` rules take precedence over ``accept`` rules.
|
||||||
|
|
||||||
|
It may also be helpful to think about a list of scope rules as a boolean
|
||||||
|
expression. For example::
|
||||||
|
|
||||||
|
blocks:
|
||||||
|
- domain: youngscholars.unimelb.edu.au
|
||||||
|
substring: wp-login.php?action=logout
|
||||||
|
- domain: malware.us
|
||||||
|
|
||||||
|
means block the url IF::
|
||||||
|
|
||||||
|
(domain: youngscholars.unimelb.edu.au AND substring: wp-login.php?action=logout) OR domain: malware.us
|
||||||
|
|
||||||
Automatic scoping based on seed urls
|
Automatic scoping based on seed urls
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
@ -346,6 +375,7 @@ List of scope rules.
|
||||||
+======+==========+=========+
|
+======+==========+=========+
|
||||||
| list | no | *none* |
|
| list | no | *none* |
|
||||||
+------+----------+---------+
|
+------+----------+---------+
|
||||||
|
List of scope rules.
|
||||||
|
|
||||||
``max_hops``
|
``max_hops``
|
||||||
~~~~~~~~~~~~
|
~~~~~~~~~~~~
|
||||||
|
@ -356,15 +386,15 @@ List of scope rules.
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
|
||||||
``max_hops_off``
|
``max_hops_off``
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
| type | required | default |
|
| type | required | default |
|
||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| number | no | 0 |
|
| number | no | 0 |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
|
||||||
Scope rule settings
|
Scope rule conditions
|
||||||
-------------------
|
---------------------
|
||||||
|
|
||||||
``domain``
|
``domain``
|
||||||
~~~~~~~~~
|
~~~~~~~~~
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue