From 98ce67ef363c99b5e990d435518b477dc0fe0ca4 Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Tue, 20 Mar 2018 17:31:42 -0700 Subject: [PATCH] WIP some words on scoping --- job-conf.rst | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 2 deletions(-) diff --git a/job-conf.rst b/job-conf.rst index 756c232..4a0dbf5 100644 --- a/job-conf.rst +++ b/job-conf.rst @@ -80,8 +80,8 @@ Notice that: - Since ``buckets`` is a list, the merged result includes all the values from both the top level and the seed level. -Settings reference -================== +Settings +======== Top-level settings ------------------ @@ -260,6 +260,7 @@ The scope of a seed determines which links are scheduled for crawling and which are not. Example:: scope: + surt: https://(com,example,)/ accepts: - parent_url_regex: ^https?://(www\.)?youtube.com/(user|channel)/.*$ regex: ^https?://(www\.)?youtube.com/watch\?.*$ @@ -272,6 +273,12 @@ are not. Example:: max_hops: 20 max_hops_off_surt: 0 +Toward the end of the process of brozzling a page, brozzler obtains a list of +navigational links (```` and similar) on the page, and evaluates +each link to determine whether it is in scope or out of scope for the crawl. +Then, newly discovered links that are in scope are scheduled to be crawled, and +previously discovered links get a priority bump. + Scope settings -------------- @@ -282,6 +289,47 @@ Scope settings +========+==========+===========================+ | string | no | *generated from seed url* | +--------+----------+---------------------------+ +This setting can be thought of as the fundamental scope setting for the seed. +Every seed has a ``scope.surt``. Brozzler will generate it from the seed url if +it is not specified explicitly. + +SURT is defined at +http://crawler.archive.org/articles/user_manual/glossary.html#surt. + + SURT stands for Sort-friendly URI Reordering Transform, and is a + transformation applied to URIs which makes their left-to-right + representation better match the natural hierarchy of domain names. + +Brozzler generates ``surt`` if not specified by canonicalizing the seed url +using the `urlcanon `_ library's "semantic" +canonicalizer, then removing the query string if any, and finally serializing +the result in SURT form. For example, a seed url of +``https://www.EXAMPLE.com:443/foo//bar?a=b&c=d#fdiap`` becomes +``https://(com,example,www,)/foo/bar``. + +If the url in the browser location bar at the end of brozzling the seed page +differs from the seed url, brozzler automatically adds an "accept" rule to +ensure the site is in scope, as if the new url were the original seed url. +It does this so that, for example, if ``http://example.com/`` redirects to +``http://www.example.com/``, the rest of the ``www.example.com`` will also be +in scope. + +Brozzler derives its general approach to the seed surt from Heritrix, but +differs in a few respects. + +1. Unlike heritrix, brozzler does not strip the path segment after the last + slash. +2. Canonicalization does not attempt to match heritrix exactly, though it + usually will match. +3. When generating a SURT for an https url, heritrix changes the scheme to + http. For example, the heritrix surt for ``https://www.example.com/`` is + ``http://(com,example,www,)`` and this means that all of + ``http://www.example.com/*`` and ``https://www.example.com/*`` will be in + scope. It also means that a manually specified surt with scheme https will + not match anything. Brozzler does no scheme munging. +4. Brozzler identifies seed "redirects" by retrieving the url from the + browser's location bar at the end of brozzling the seed page, whereas + heritrix follows http redirects. ``accepts`` ~~~~~~~~~~~ @@ -290,6 +338,7 @@ Scope settings +======+==========+=========+ | list | no | *none* | +------+----------+---------+ +List of scope rules. ``blocks`` ~~~~~~~~~~~