more README.rst updates

This commit is contained in:
Gretchen Miller 2025-03-05 14:27:03 -08:00
parent d7a6d7ae66
commit 29a46a1925

View file

@ -1,6 +1,3 @@
.. image:: https://api.travis-ci.org/internetarchive/brozzler.svg?branch=master
:target: https://travis-ci.org/internetarchive/brozzler
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg .. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg
:width: 60px :width: 60px
@ -10,11 +7,12 @@
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
or Chromium) to fetch pages and embedded URLs and to extract links. It employs or Chromium) to fetch pages and embedded URLs and to extract links. It employs
`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance media capture `yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to media capture capabilities and `rethinkdb
manage crawl state. <https://github.com/rethinkdb/rethinkdb>`_ to manage crawl state.
Brozzler is designed to work in conjunction with warcprox for web archiving. Brozzler is designed to work in conjunction with `warcprox
<https://github.com/internetarchive/warcprox>`_ for web archiving.
Requirements Requirements
------------ ------------
@ -34,8 +32,10 @@ so Xvnc4 is preferred at this time.)
Getting Started Getting Started
--------------- ---------------
The simplest way to get started with brozzler is to use the ``brozzle-page`` The simplest way to get started with Brozzler is to use the ``brozzle-page``
command-line tool to pass in a single URL to crawl. command-line utility to pass in a single URL to crawl. You can also add a new
job defined with a YAML file (see `job-const.rst`) and start a local Brozzler
worker for a more complex crawl.
Mac instructions: Mac instructions:
@ -49,23 +49,21 @@ Mac instructions:
# optional: create a virtualenv # optional: create a virtualenv
python -m venv .venv python -m venv .venv
# install brozzler # install brozzler with rethinkdb extra
pip install brozzler pip install brozzler[rethinkdb]
# queue a site to crawl # crawl a single site
brozzle-page http://example.com/ brozzle-page https://example.org
# or a job # or enqueue a job and start brozzler-worker
brozzler-new-job job1.yml brozzler-new-job job1.yml
# start brozzler-worker
brozzler-worker brozzler-worker
At this point brozzler-easy will start archiving your site. Results will be At this point Brozzler will start archiving your site.
immediately available for playback in pywb at http://localhost:8880/brozzler/.
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but *Running Brozzler locally in this manner demonstrates the full Brozzler
does not take advantage of brozzler's distributed nature.* archival crawling workflow, but does not take advantage of Brozzler's
distributed nature.*
Installation and Usage Installation and Usage
---------------------- ----------------------
@ -84,7 +82,7 @@ Submit jobs::
Submit sites not tied to a job:: Submit sites not tied to a job::
brozzler-new-site --time-limit=600 http://example.com/ brozzler-new-site --time-limit=600 https://example.org/
.. [*] A note about ``--warcprox-auto``: this option tells brozzler to .. [*] A note about ``--warcprox-auto``: this option tells brozzler to
look for a healthy warcprox instance in the `rethinkdb service registry look for a healthy warcprox instance in the `rethinkdb service registry
@ -109,14 +107,14 @@ however everything else is optional. For details, see `<job-conf.rst>`_.
warcprox_meta: null warcprox_meta: null
metadata: {} metadata: {}
seeds: seeds:
- url: http://one.example.org/ - url: https://one.example.org/
- url: http://two.example.org/ - url: https://two.example.org/
time_limit: 30 time_limit: 30
- url: http://three.example.org/ - url: https://three.example.org/
time_limit: 10 time_limit: 10
ignore_robots: true ignore_robots: true
scope: scope:
surt: http://(org,example, surt: https://(org,example,
Brozzler Dashboard Brozzler Dashboard
------------------ ------------------
@ -135,53 +133,12 @@ To start the app, run
brozzler-dashboard brozzler-dashboard
At this point Brozzler Dashboard will be accessible at http://localhost:8000/. At this point Brozzler Dashboard will be accessible at https://localhost:8000/.
.. image:: Brozzler-Dashboard.png .. image:: Brozzler-Dashboard.png
See ``brozzler-dashboard --help`` for configuration options. See ``brozzler-dashboard --help`` for configuration options.
Brozzler Wayback
----------------
Brozzler comes with a customized version of `pywb
<https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb
"captures" table (populated by warcprox) as its index.
To use, first install dependencies.
::
pip install brozzler[easy]
Write a configuration file pywb.yml.
::
# 'archive_paths' should point to the output directory of warcprox
archive_paths: warcs/ # pywb will fail without a trailing slash
collections:
brozzler:
index_paths: !!python/object:brozzler.pywb.RethinkCDXSource
db: brozzler
table: captures
servers:
- localhost
enable_auto_colls: false
enable_cdx_api: true
framed_replay: true
port: 8880
Run pywb like so:
::
$ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback
Then browse http://localhost:8880/brozzler/.
.. image:: Brozzler-Wayback.png
Headless Chrome (experimental) Headless Chrome (experimental)
------------------------------ ------------------------------