mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-08-24 13:59:50 -04:00
documentation tweak
This commit is contained in:
parent
aa2d491009
commit
5fdb2dd39c
1 changed files with 6 additions and 6 deletions
12
job-conf.rst
12
job-conf.rst
|
@ -339,12 +339,12 @@ Brozzler derives its general approach to the seed surt from `heritrix
|
||||||
slash.
|
slash.
|
||||||
2. Canonicalization does not attempt to match heritrix exactly, though it
|
2. Canonicalization does not attempt to match heritrix exactly, though it
|
||||||
usually does match.
|
usually does match.
|
||||||
3. When generating a SURT for an HTTPS URL, heritrix changes the scheme to
|
3. Brozzler does no scheme munging. (When generating a SURT for an HTTPS URL,
|
||||||
HTTP. For example, the heritrix SURT for ``https://www.example.com/`` is
|
heritrix changes the scheme to HTTP. For example, the heritrix SURT for
|
||||||
``http://(com,example,www,)`` and this means that all of
|
``https://www.example.com/`` is ``http://(com,example,www,)`` and this means
|
||||||
``http://www.example.com/*`` and ``https://www.example.com/*`` are in
|
that all of ``http://www.example.com/*`` and ``https://www.example.com/*``
|
||||||
scope. It also means that a manually specified SURT with scheme "https" does
|
are in scope. It also means that a manually specified SURT with scheme
|
||||||
not match anything. Brozzler does no scheme munging.
|
"https" does not match anything.)
|
||||||
4. Brozzler identifies seed "redirects" by retrieving the URL from the
|
4. Brozzler identifies seed "redirects" by retrieving the URL from the
|
||||||
browser's location bar at the end of brozzling the seed page, whereas
|
browser's location bar at the end of brozzling the seed page, whereas
|
||||||
heritrix follows HTTP 3XX redirects. If the URL in the browser
|
heritrix follows HTTP 3XX redirects. If the URL in the browser
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue