mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-09-22 13:44:51 -04:00
update docs to match new seed ssurt behavior
This commit is contained in:
parent
fc05cac338
commit
2cf474aa1d
1 changed files with 32 additions and 33 deletions
65
job-conf.rst
65
job-conf.rst
|
@ -260,12 +260,12 @@ The scope of a seed determines which links are scheduled for crawling and which
|
||||||
are not. Example::
|
are not. Example::
|
||||||
|
|
||||||
scope:
|
scope:
|
||||||
surt: https://(com,example,)/
|
|
||||||
accepts:
|
accepts:
|
||||||
- parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$
|
- parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$
|
||||||
regex: ^https?://(www\.)?youtube.com/watch\?.*$
|
regex: ^https?://(www\.)?youtube.com/watch\?.*$
|
||||||
- surt: +http://(com,google,video,
|
- ssurt: com,example,//https:/
|
||||||
- surt: +http://(com,googlevideo,
|
- surt: http://(com,google,video,
|
||||||
|
- surt: http://(com,googlevideo,
|
||||||
blocks:
|
blocks:
|
||||||
- domain: youngscholars.unimelb.edu.au
|
- domain: youngscholars.unimelb.edu.au
|
||||||
substring: wp-login.php?action=logout
|
substring: wp-login.php?action=logout
|
||||||
|
@ -279,40 +279,33 @@ each link to determine whether it is in scope or out of scope for the crawl.
|
||||||
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
Then, newly discovered links that are in scope are scheduled to be crawled, and
|
||||||
previously discovered links get a priority bump.
|
previously discovered links get a priority bump.
|
||||||
|
|
||||||
Scope settings
|
How scope rules are applied
|
||||||
--------------
|
---------------------------
|
||||||
|
1. If any ``block`` rule matches, the url is out of scope.
|
||||||
|
2. Otherwise, if any ``accept`` rule matches, the url is in scope.
|
||||||
|
3. Otherwise (no rules match), the url is out of scope.
|
||||||
|
|
||||||
``surt``
|
In other words, by default urls are not in scope, and ``block`` rules take
|
||||||
~~~~~~~~
|
precedence over ``accept`` rules.
|
||||||
+--------+----------+---------------------------+
|
|
||||||
| type | required | default |
|
|
||||||
+========+==========+===========================+
|
|
||||||
| string | no | *generated from seed url* |
|
|
||||||
+--------+----------+---------------------------+
|
|
||||||
This setting can be thought of as the fundamental scope setting for the seed.
|
|
||||||
Every seed has a ``scope.surt``. Brozzler will generate it from the seed url if
|
|
||||||
it is not specified explicitly.
|
|
||||||
|
|
||||||
SURT is defined at
|
Automatic scoping based on seed urls
|
||||||
http://crawler.archive.org/articles/user_manual/glossary.html#surt.
|
------------------------------------
|
||||||
|
Brozzler usually generates an ``accept`` scope rule based on the seed url. It
|
||||||
|
does this to fulfill a crawl operator's expectation that everything "under" the
|
||||||
|
seed will be crawled.
|
||||||
|
|
||||||
SURT stands for Sort-friendly URI Reordering Transform, and is a
|
To generate the rule, brozzler canonicalizes the seed url using the `urlcanon
|
||||||
transformation applied to URIs which makes their left-to-right
|
<https://github.com/iipc/urlcanon>`_ library's "semantic" canonicalizer, then
|
||||||
representation better match the natural hierarchy of domain names.
|
removing the query string if any, and finally serializing the result in SURT
|
||||||
|
form. For example, a seed url of
|
||||||
Brozzler generates ``surt`` if not specified by canonicalizing the seed url
|
|
||||||
using the `urlcanon <https://github.com/iipc/urlcanon>`_ library's "semantic"
|
|
||||||
canonicalizer, then removing the query string if any, and finally serializing
|
|
||||||
the result in SURT form. For example, a seed url of
|
|
||||||
``https://www.EXAMPLE.com:443/foo//bar?a=b&c=d#fdiap`` becomes
|
``https://www.EXAMPLE.com:443/foo//bar?a=b&c=d#fdiap`` becomes
|
||||||
``https://(com,example,www,)/foo/bar``.
|
``com,example,www,//https:/foo/bar?a=b&c=d``.
|
||||||
|
|
||||||
If the url in the browser location bar at the end of brozzling the seed page
|
If the url in the browser location bar at the end of brozzling the seed page
|
||||||
differs from the seed url, brozzler automatically adds an "accept" rule to
|
differs from the seed url, brozzler automatically adds a second "accept" rule
|
||||||
ensure the site is in scope, as if the new url were the original seed url.
|
to ensure the site is in scope, as if the new url were the original seed url.
|
||||||
It does this so that, for example, if ``http://example.com/`` redirects to
|
It does this so that, for example, if ``http://example.com/`` redirects to
|
||||||
``http://www.example.com/``, the rest of the ``www.example.com`` will also be
|
``http://www.example.com/``, the rest of the ``www.example.com`` is in scope.
|
||||||
in scope.
|
|
||||||
|
|
||||||
Brozzler derives its general approach to the seed surt from Heritrix, but
|
Brozzler derives its general approach to the seed surt from Heritrix, but
|
||||||
differs in a few respects.
|
differs in a few respects.
|
||||||
|
@ -320,16 +313,22 @@ differs in a few respects.
|
||||||
1. Unlike heritrix, brozzler does not strip the path segment after the last
|
1. Unlike heritrix, brozzler does not strip the path segment after the last
|
||||||
slash.
|
slash.
|
||||||
2. Canonicalization does not attempt to match heritrix exactly, though it
|
2. Canonicalization does not attempt to match heritrix exactly, though it
|
||||||
usually will match.
|
usually does match.
|
||||||
3. When generating a SURT for an https url, heritrix changes the scheme to
|
3. When generating a SURT for an https url, heritrix changes the scheme to
|
||||||
http. For example, the heritrix surt for ``https://www.example.com/`` is
|
http. For example, the heritrix surt for ``https://www.example.com/`` is
|
||||||
``http://(com,example,www,)`` and this means that all of
|
``http://(com,example,www,)`` and this means that all of
|
||||||
``http://www.example.com/*`` and ``https://www.example.com/*`` will be in
|
``http://www.example.com/*`` and ``https://www.example.com/*`` are in
|
||||||
scope. It also means that a manually specified surt with scheme https will
|
scope. It also means that a manually specified surt with scheme "https" does
|
||||||
not match anything. Brozzler does no scheme munging.
|
not match anything. Brozzler does no scheme munging.
|
||||||
4. Brozzler identifies seed "redirects" by retrieving the url from the
|
4. Brozzler identifies seed "redirects" by retrieving the url from the
|
||||||
browser's location bar at the end of brozzling the seed page, whereas
|
browser's location bar at the end of brozzling the seed page, whereas
|
||||||
heritrix follows http redirects.
|
heritrix follows http redirects.
|
||||||
|
5. Brozzler uses ssurt instead of surt.
|
||||||
|
6. There is currently no brozzler option to disable the automatic ``accept``
|
||||||
|
surt(s).
|
||||||
|
|
||||||
|
Scope settings
|
||||||
|
--------------
|
||||||
|
|
||||||
``accepts``
|
``accepts``
|
||||||
~~~~~~~~~~~
|
~~~~~~~~~~~
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue