mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-24 00:29:53 -05:00
describe scope rule conditions
plus a bunch of tweaks and fixes
This commit is contained in:
parent
a327cb626f
commit
de1f240e25
59
job-conf.rst
59
job-conf.rst
@ -42,9 +42,9 @@ How inheritance works
|
|||||||
|
|
||||||
Most of the settings that apply to seeds can also be specified at the top
|
Most of the settings that apply to seeds can also be specified at the top
|
||||||
level, in which case all seeds inherit those settings. If an option is
|
level, in which case all seeds inherit those settings. If an option is
|
||||||
specified both at the top level and at the level of an individual seed, the
|
specified both at the top level and at seed level, the results are merged with
|
||||||
results are merged with the seed-level value taking precedence in case of
|
the seed-level value taking precedence in case of conflicts. It's probably
|
||||||
conflicts. It's probably easiest to make sense of this by way of an example.
|
easiest to make sense of this by way of an example.
|
||||||
|
|
||||||
In the example yaml above, ``warcprox_meta`` is specified at the top level and
|
In the example yaml above, ``warcprox_meta`` is specified at the top level and
|
||||||
at the seed level for the seed http://one.example.org/. At the top level we
|
at the seed level for the seed http://one.example.org/. At the top level we
|
||||||
@ -155,8 +155,8 @@ on each page it crawls for that seed. A form that has a single text or email
|
|||||||
field (the username), a single password field (``<input type="password">``),
|
field (the username), a single password field (``<input type="password">``),
|
||||||
and has ``method="POST"`` is considered to be a login form. The form may have
|
and has ``method="POST"`` is considered to be a login form. The form may have
|
||||||
other fields like checkboxes and hidden fields. For these, brozzler will leave
|
other fields like checkboxes and hidden fields. For these, brozzler will leave
|
||||||
the default values in place. Login form detection and submission happen after
|
the default values in place. Brozzler submits login forms after page load.
|
||||||
page load, then brozzling proceeds as usual.
|
Then brozzling proceeds as usual.
|
||||||
|
|
||||||
Seed-level / top-level settings
|
Seed-level / top-level settings
|
||||||
-------------------------------
|
-------------------------------
|
||||||
@ -180,7 +180,7 @@ by brozzler for anything. Could be of use to some external process.
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| number | no | *none* |
|
| number | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
Time limit in seconds. If not specified, there no time limit. Time limit is
|
Time limit in seconds. If not specified, there is no time limit. Time limit is
|
||||||
enforced at the seed level. If a time limit is specified at the top level, it
|
enforced at the seed level. If a time limit is specified at the top level, it
|
||||||
is inherited by each seed as described above, and enforced individually on each
|
is inherited by each seed as described above, and enforced individually on each
|
||||||
seed.
|
seed.
|
||||||
@ -279,8 +279,8 @@ each link to determine whether it is in scope or out of scope for the crawl.
|
|||||||
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
||||||
previously discovered links get a priority bump.
|
previously discovered links get a priority bump.
|
||||||
|
|
||||||
Applying scope rules
|
How brozzler applies scope rules
|
||||||
--------------------
|
--------------------------------
|
||||||
|
|
||||||
Each scope rule has one or more conditions. If all of the conditions match,
|
Each scope rule has one or more conditions. If all of the conditions match,
|
||||||
then the scope rule as a whole matches. For example::
|
then the scope rule as a whole matches. For example::
|
||||||
@ -292,8 +292,8 @@ This rule applies if the domain of the url is "youngscholars.unimelb.edu.au" or
|
|||||||
a subdomain, and the string "wp-login.php?action=logout" is found somewhere in
|
a subdomain, and the string "wp-login.php?action=logout" is found somewhere in
|
||||||
the url.
|
the url.
|
||||||
|
|
||||||
Brozzler applies these logical steps to decide whether a page url is in or out
|
Brozzler applies these logical steps to decide whether a url is in or out of
|
||||||
of scope:
|
scope:
|
||||||
|
|
||||||
1. If the number of hops from seed is greater than ``max_hops``, the url is
|
1. If the number of hops from seed is greater than ``max_hops``, the url is
|
||||||
**out of scope**.
|
**out of scope**.
|
||||||
@ -315,23 +315,23 @@ expression. For example::
|
|||||||
|
|
||||||
means block the url IF::
|
means block the url IF::
|
||||||
|
|
||||||
(domain: youngscholars.unimelb.edu.au AND substring: wp-login.php?action=logout) OR domain: malware.us
|
("domain: youngscholars.unimelb.edu.au" AND "substring: wp-login.php?action=logout") OR "domain: malware.us"
|
||||||
|
|
||||||
Automatic scoping based on seed urls
|
Automatic scoping based on seed urls
|
||||||
------------------------------------
|
------------------------------------
|
||||||
Brozzler usually generates an ``accept`` scope rule based on the seed url. It
|
Brozzler usually generates an ``accept`` scope rule based on the seed url. It
|
||||||
does this to fulfill a crawl operator's expectation that everything "under" the
|
does this to fulfill the usual expectation that everything "under" the seed
|
||||||
seed will be crawled.
|
will be crawled.
|
||||||
|
|
||||||
To generate the rule, brozzler canonicalizes the seed url using the `urlcanon
|
To generate the rule, brozzler canonicalizes the seed url using the `urlcanon
|
||||||
<https://github.com/iipc/urlcanon>`_ library's "semantic" canonicalizer, then
|
<https://github.com/iipc/urlcanon>`_ library's "semantic" canonicalizer, then
|
||||||
removing the query string if any, and finally serializing the result in SURT
|
removes the query string if any, and finally serializes the result in SSURT
|
||||||
form. For example, a seed url of
|
[1]_ form. For example, a seed url of
|
||||||
``https://www.EXAMPLE.com:443/foo//bar?a=b&c=d#fdiap`` becomes
|
``https://www.EXAMPLE.com:443/foo//bar?a=b&c=d#fdiap`` becomes
|
||||||
``com,example,www,//https:/foo/bar?a=b&c=d``.
|
``com,example,www,//https:/foo/bar?a=b&c=d``.
|
||||||
|
|
||||||
If the url in the browser location bar at the end of brozzling the seed page
|
If the url in the browser location bar at the end of brozzling the seed page
|
||||||
differs from the seed url, brozzler automatically adds a second "accept" rule
|
differs from the seed url, brozzler automatically adds a second ``accept`` rule
|
||||||
to ensure the site is in scope, as if the new url were the original seed url.
|
to ensure the site is in scope, as if the new url were the original seed url.
|
||||||
It does this so that, for example, if ``http://example.com/`` redirects to
|
It does this so that, for example, if ``http://example.com/`` redirects to
|
||||||
``http://www.example.com/``, the rest of the ``www.example.com`` is in scope.
|
``http://www.example.com/``, the rest of the ``www.example.com`` is in scope.
|
||||||
@ -343,7 +343,7 @@ differs in a few respects.
|
|||||||
slash.
|
slash.
|
||||||
2. Canonicalization does not attempt to match heritrix exactly, though it
|
2. Canonicalization does not attempt to match heritrix exactly, though it
|
||||||
usually does match.
|
usually does match.
|
||||||
3. When generating a SURT for an https url, heritrix changes the scheme to
|
3. When generating a surt for an https url, heritrix changes the scheme to
|
||||||
http. For example, the heritrix surt for ``https://www.example.com/`` is
|
http. For example, the heritrix surt for ``https://www.example.com/`` is
|
||||||
``http://(com,example,www,)`` and this means that all of
|
``http://(com,example,www,)`` and this means that all of
|
||||||
``http://www.example.com/*`` and ``https://www.example.com/*`` are in
|
``http://www.example.com/*`` and ``https://www.example.com/*`` are in
|
||||||
@ -351,10 +351,10 @@ differs in a few respects.
|
|||||||
not match anything. Brozzler does no scheme munging.
|
not match anything. Brozzler does no scheme munging.
|
||||||
4. Brozzler identifies seed "redirects" by retrieving the url from the
|
4. Brozzler identifies seed "redirects" by retrieving the url from the
|
||||||
browser's location bar at the end of brozzling the seed page, whereas
|
browser's location bar at the end of brozzling the seed page, whereas
|
||||||
heritrix follows http redirects.
|
heritrix follows http 3xx redirects.
|
||||||
5. Brozzler uses ssurt instead of surt.
|
5. Brozzler uses ssurt instead of surt.
|
||||||
6. There is currently no brozzler option to disable the automatic ``accept``
|
6. There is currently no brozzler option to disable the automatically generated
|
||||||
surt(s).
|
``accept`` rules.
|
||||||
|
|
||||||
Scope settings
|
Scope settings
|
||||||
--------------
|
--------------
|
||||||
@ -366,7 +366,9 @@ Scope settings
|
|||||||
+======+==========+=========+
|
+======+==========+=========+
|
||||||
| list | no | *none* |
|
| list | no | *none* |
|
||||||
+------+----------+---------+
|
+------+----------+---------+
|
||||||
List of scope rules.
|
List of scope rules. If any of the rules match, and the url is within
|
||||||
|
``max_hops`` from seed, and none of the ``block`` rules apply, the url is in
|
||||||
|
scope.
|
||||||
|
|
||||||
``blocks``
|
``blocks``
|
||||||
~~~~~~~~~~~
|
~~~~~~~~~~~
|
||||||
@ -375,7 +377,7 @@ List of scope rules.
|
|||||||
+======+==========+=========+
|
+======+==========+=========+
|
||||||
| list | no | *none* |
|
| list | no | *none* |
|
||||||
+------+----------+---------+
|
+------+----------+---------+
|
||||||
List of scope rules.
|
List of scope rules. If any of the rules match, the url is deemed out of scope.
|
||||||
|
|
||||||
``max_hops``
|
``max_hops``
|
||||||
~~~~~~~~~~~~
|
~~~~~~~~~~~~
|
||||||
@ -384,6 +386,7 @@ List of scope rules.
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| number | no | *none* |
|
| number | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Maximum number of hops from seed.
|
||||||
|
|
||||||
``max_hops_off``
|
``max_hops_off``
|
||||||
~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~
|
||||||
@ -392,6 +395,8 @@ List of scope rules.
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| number | no | 0 |
|
| number | no | 0 |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Expands the scope to include urls up to this many hops from the last page that
|
||||||
|
was in scope thanks to an ``accept`` rule.
|
||||||
|
|
||||||
Scope rule conditions
|
Scope rule conditions
|
||||||
---------------------
|
---------------------
|
||||||
@ -403,6 +408,8 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if the host part of the canonicalized url is ``domain`` or a
|
||||||
|
subdomain.
|
||||||
|
|
||||||
``substring``
|
``substring``
|
||||||
~~~~~~~~~~~~~
|
~~~~~~~~~~~~~
|
||||||
@ -411,6 +418,7 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if ``substring`` is found anywhere in the canonicalized url.
|
||||||
|
|
||||||
``regex``
|
``regex``
|
||||||
~~~~~~~~~
|
~~~~~~~~~
|
||||||
@ -419,6 +427,7 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if the full canonicalized url matches ``regex``.
|
||||||
|
|
||||||
``ssurt``
|
``ssurt``
|
||||||
~~~~~~~~~
|
~~~~~~~~~
|
||||||
@ -427,6 +436,7 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if the canonicalized url in SSURT [1]_ form starts with ``ssurt``.
|
||||||
|
|
||||||
``surt``
|
``surt``
|
||||||
~~~~~~~~
|
~~~~~~~~
|
||||||
@ -435,6 +445,7 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if the canonicalized url in SURT [2]_ form starts with ``surt``.
|
||||||
|
|
||||||
``parent_url_regex``
|
``parent_url_regex``
|
||||||
~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
@ -443,4 +454,8 @@ Scope rule conditions
|
|||||||
+========+==========+=========+
|
+========+==========+=========+
|
||||||
| string | no | *none* |
|
| string | no | *none* |
|
||||||
+--------+----------+---------+
|
+--------+----------+---------+
|
||||||
|
Matches if the full canonicalized parent url matches ``regex``. The parent url
|
||||||
|
is the url of the page in which the link was found.
|
||||||
|
|
||||||
|
.. [1] SSURT is described at https://github.com/iipc/urlcanon/blob/master/ssurt.rst
|
||||||
|
.. [2] SURT is described at http://crawler.archive.org/articles/user_manual/glossary.html
|
||||||
|
Loading…
x
Reference in New Issue
Block a user