mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-07-31 10:38:53 -04:00
explain brozzler use of warcprox_meta
This commit is contained in:
parent
aef4c40993
commit
a00b5a7fd5
2 changed files with 265 additions and 5 deletions
52
job-conf.rst
52
job-conf.rst
|
@ -224,11 +224,11 @@ contact the operator if the crawl is causing problems.
|
||||||
+============+==========+===========+
|
+============+==========+===========+
|
||||||
| dictionary | no | ``false`` |
|
| dictionary | no | ``false`` |
|
||||||
+------------+----------+-----------+
|
+------------+----------+-----------+
|
||||||
Specifies the Warcprox-Meta header to send with every request, if ``proxy`` is
|
Specifies the ``Warcprox-Meta`` header to send with every request, if ``proxy``
|
||||||
configured. The value of the Warcprox-Meta header is a json blob. It is used to
|
is configured. The value of the ``Warcprox-Meta`` header is a json blob. It is
|
||||||
pass settings and information to warcprox. Warcprox does not forward the header
|
used to pass settings and information to warcprox. Warcprox does not forward
|
||||||
on to the remote site. See the warcprox docs for more information (XXX not yet
|
the header on to the remote site. For full documentation on ``warcprox-meta``
|
||||||
written).
|
see https://github.com/internetarchive/warcprox/blob/master/api.rst#warcprox-meta-http-request-header
|
||||||
|
|
||||||
Brozzler takes the configured value of ``warcprox_meta``, converts it to
|
Brozzler takes the configured value of ``warcprox_meta``, converts it to
|
||||||
json and populates the Warcprox-Meta header with that value. For example::
|
json and populates the Warcprox-Meta header with that value. For example::
|
||||||
|
@ -457,5 +457,47 @@ Matches if the canonicalized url in SURT [2]_ form starts with ``surt``.
|
||||||
Matches if the full canonicalized parent url matches ``regex``. The parent url
|
Matches if the full canonicalized parent url matches ``regex``. The parent url
|
||||||
is the url of the page in which the link was found.
|
is the url of the page in which the link was found.
|
||||||
|
|
||||||
|
Using ``warcprox_meta``
|
||||||
|
=======================
|
||||||
|
``warcprox_meta`` deserves some more discussion. It plays a very important role
|
||||||
|
in brozzler job configuration. ``warcprox_meta`` is the way you set the
|
||||||
|
filenames of the warcs for your crawl. For example, if each seed should have a
|
||||||
|
different warc name prefix, you might have a job configured this way::
|
||||||
|
|
||||||
|
seeds:
|
||||||
|
- url: https://example.com/
|
||||||
|
warcprox_meta:
|
||||||
|
warc-prefix: seed1
|
||||||
|
- url: https://archive.org/
|
||||||
|
warcprox_meta:
|
||||||
|
warc-prefix: seed2
|
||||||
|
|
||||||
|
``warcprox_meta`` is also the way to put limits on the size of the crawl job.
|
||||||
|
For example, this configuration will stop the crawl after about 100 MB of novel
|
||||||
|
content has been crawled::
|
||||||
|
|
||||||
|
seeds:
|
||||||
|
- url: https://example.com/
|
||||||
|
- url: https://archive.org/
|
||||||
|
warcprox_meta:
|
||||||
|
stats:
|
||||||
|
buckets:
|
||||||
|
- my-job
|
||||||
|
limits:
|
||||||
|
my-job/new/wire_bytes: 100000000
|
||||||
|
|
||||||
|
To prevent any urls from a host from being captured, it's not sufficient to use
|
||||||
|
a ``scope`` rule as described above. That kind of scoping only applies to
|
||||||
|
navigational links discovered in crawled pages. To make absolutely sure no url
|
||||||
|
from a given host is fetched, not even (say) an image embedded in a page, use
|
||||||
|
``warcprox_meta`` like so::
|
||||||
|
|
||||||
|
warcprox_meta:
|
||||||
|
blocks:
|
||||||
|
- domain: spammy.com
|
||||||
|
|
||||||
|
For complete documentation on the ``warcprox-meta`` request header, see
|
||||||
|
https://github.com/internetarchive/warcprox/blob/master/api.rst#warcprox-meta-http-request-header
|
||||||
|
|
||||||
.. [1] SSURT is described at https://github.com/iipc/urlcanon/blob/master/ssurt.rst
|
.. [1] SSURT is described at https://github.com/iipc/urlcanon/blob/master/ssurt.rst
|
||||||
.. [2] SURT is described at http://crawler.archive.org/articles/user_manual/glossary.html
|
.. [2] SURT is described at http://crawler.archive.org/articles/user_manual/glossary.html
|
||||||
|
|
218
readme.rst
Normal file
218
readme.rst
Normal file
|
@ -0,0 +1,218 @@
|
||||||
|
.. image:: https://travis-ci.org/internetarchive/brozzler.svg?branch=master
|
||||||
|
:target: https://travis-ci.org/internetarchive/brozzler
|
||||||
|
|
||||||
|
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
|
||||||
|
:width: 60px
|
||||||
|
|
||||||
|
|logo| brozzler
|
||||||
|
===============
|
||||||
|
"browser" \| "crawler" = "brozzler"
|
||||||
|
|
||||||
|
Brozzler is a distributed web crawler (爬虫) that uses a real browser (chrome
|
||||||
|
or chromium) to fetch pages and embedded urls and to extract links. It also
|
||||||
|
uses `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media
|
||||||
|
capture capabilities.
|
||||||
|
|
||||||
|
Brozzler is designed to work in conjunction with
|
||||||
|
`warcprox <https://github.com/internetarchive/warcprox>`_ for web
|
||||||
|
archiving.
|
||||||
|
|
||||||
|
Requirements
|
||||||
|
------------
|
||||||
|
|
||||||
|
- Python 3.4 or later
|
||||||
|
- RethinkDB deployment
|
||||||
|
- Chromium or Google Chrome >= version 64
|
||||||
|
|
||||||
|
Worth noting is that the browser requires a graphical environment to run. You
|
||||||
|
already have this on your laptop, but on a server it will probably require
|
||||||
|
deploying some additional infrastructure (typically X11; note that Xvfb does
|
||||||
|
not support screenshots; Xvnc4, from package vnc4server, does). The vagrant
|
||||||
|
configuration in the brozzler repository (still a work in progress) has an
|
||||||
|
example setup.
|
||||||
|
|
||||||
|
Getting Started
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The easiest way to get started with brozzler for web archiving is with
|
||||||
|
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox,
|
||||||
|
`pywb <https://github.com/ikreymer/pywb>`_, and brozzler-dashboard, configured
|
||||||
|
to work with each other, in a single process.
|
||||||
|
|
||||||
|
Mac instructions:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# install and start rethinkdb
|
||||||
|
brew install rethinkdb
|
||||||
|
# no brew? try rethinkdb's installer: https://www.rethinkdb.com/docs/install/osx/
|
||||||
|
rethinkdb &>>rethinkdb.log &
|
||||||
|
|
||||||
|
# install brozzler with special dependencies pywb and warcprox
|
||||||
|
pip install brozzler[easy] # in a virtualenv if desired
|
||||||
|
|
||||||
|
# queue a site to crawl
|
||||||
|
brozzler-new-site http://example.com/
|
||||||
|
|
||||||
|
# or a job
|
||||||
|
brozzler-new-job job1.yml
|
||||||
|
|
||||||
|
# start brozzler-easy
|
||||||
|
brozzler-easy
|
||||||
|
|
||||||
|
At this point brozzler-easy will start brozzling your site. Results will be
|
||||||
|
immediately available for playback in pywb at http://localhost:8880/brozzler/.
|
||||||
|
|
||||||
|
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
|
||||||
|
does not take advantage of brozzler's distributed nature.*
|
||||||
|
|
||||||
|
Installation and Usage
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
To install brozzler only::
|
||||||
|
|
||||||
|
pip install brozzler # in a virtualenv if desired
|
||||||
|
|
||||||
|
Launch one or more workers::
|
||||||
|
|
||||||
|
brozzler-worker --warcprox-auto
|
||||||
|
|
||||||
|
Submit jobs::
|
||||||
|
|
||||||
|
brozzler-new-job myjob.yaml
|
||||||
|
|
||||||
|
Submit sites not tied to a job::
|
||||||
|
|
||||||
|
brozzler-new-site --time-limit=600 http://example.com/
|
||||||
|
|
||||||
|
Job Configuration
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Jobs are defined using yaml files. Options may be specified either at the
|
||||||
|
top-level or on individual seeds. At least one seed url must be specified,
|
||||||
|
everything else is optional. For details, see `<job-conf.rst>`_.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
id: myjob
|
||||||
|
time_limit: 60 # seconds
|
||||||
|
proxy: 127.0.0.1:8000 # point at warcprox for archiving
|
||||||
|
ignore_robots: false
|
||||||
|
warcprox_meta: null
|
||||||
|
metadata: {}
|
||||||
|
seeds:
|
||||||
|
- url: http://one.example.org/
|
||||||
|
- url: http://two.example.org/
|
||||||
|
time_limit: 30
|
||||||
|
- url: http://three.example.org/
|
||||||
|
time_limit: 10
|
||||||
|
ignore_robots: true
|
||||||
|
scope:
|
||||||
|
surt: http://(org,example,
|
||||||
|
|
||||||
|
Brozzler Dashboard
|
||||||
|
------------------
|
||||||
|
|
||||||
|
Brozzler comes with a rudimentary web application for viewing crawl job status.
|
||||||
|
To install the brozzler with dependencies required to run this app, run
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
pip install brozzler[dashboard]
|
||||||
|
|
||||||
|
|
||||||
|
To start the app, run
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
brozzler-dashboard
|
||||||
|
|
||||||
|
See ``brozzler-dashboard --help`` for configuration options.
|
||||||
|
|
||||||
|
Brozzler Wayback
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Brozzler comes with a customized version of
|
||||||
|
`pywb <https://github.com/ikreymer/pywb>`_ which supports using the rethinkdb
|
||||||
|
"captures" table (populated by warcprox) as its index.
|
||||||
|
|
||||||
|
To use, first install dependencies.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
pip install brozzler[easy]
|
||||||
|
|
||||||
|
Write a configuration file pywb.yml.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# 'archive_paths' should point to the output directory of warcprox
|
||||||
|
archive_paths: warcs/ # pywb will fail without a trailing slash
|
||||||
|
collections:
|
||||||
|
brozzler:
|
||||||
|
index_paths: !!python/object:brozzler.pywb.RethinkCDXSource
|
||||||
|
db: brozzler
|
||||||
|
table: captures
|
||||||
|
servers:
|
||||||
|
- localhost
|
||||||
|
enable_auto_colls: false
|
||||||
|
enable_cdx_api: true
|
||||||
|
framed_replay: true
|
||||||
|
port: 8880
|
||||||
|
|
||||||
|
Run pywb like so:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
$ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback
|
||||||
|
|
||||||
|
Then browse http://localhost:8880/brozzler/.
|
||||||
|
|
||||||
|
|
||||||
|
Headless Chrome (experimental)
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
`Headless Chromium <https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_
|
||||||
|
is now available in stable Chrome releases for 64-bit Linux and may be
|
||||||
|
used to run the browser without a visible window or X11 at all.
|
||||||
|
|
||||||
|
To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
#!/bin/bash
|
||||||
|
exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
|
||||||
|
|
||||||
|
Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
|
||||||
|
option:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
chmod +x ~/bin/chrome-headless.sh
|
||||||
|
brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
|
||||||
|
|
||||||
|
Beware: Chrome's headless mode is still very new and has a number of
|
||||||
|
`unresolved issues. <https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_
|
||||||
|
You may experience hangs or crashes with some types of content. Brozzler
|
||||||
|
has not had much testing with it. For the moment we recommend using
|
||||||
|
Chrome's regular mode instead.
|
||||||
|
|
||||||
|
License
|
||||||
|
-------
|
||||||
|
|
||||||
|
Copyright 2015-2018 Internet Archive
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
not use this software except in compliance with the License. You may
|
||||||
|
obtain a copy of the License at
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue