Update README.rst to remove brozzler-easy and Wayback sections + other cleanup (#336)

* update instructions for brozzler-easy + add pywb extras

* revert pywb extra + updated README

* ruffing up

* more README.rst updates

* revert https change for local URL scheme
This commit is contained in:
Gretchen Leigh Miller 2025-03-05 14:33:47 -08:00 committed by GitHub
parent 05b72906bd
commit 5350c202dc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,6 +1,3 @@
.. image:: https://api.travis-ci.org/internetarchive/brozzler.svg?branch=master
:target: https://travis-ci.org/internetarchive/brozzler
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg .. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg
:width: 60px :width: 60px
@ -10,11 +7,12 @@
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
or Chromium) to fetch pages and embedded URLs and to extract links. It employs or Chromium) to fetch pages and embedded URLs and to extract links. It employs
`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance media capture `yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to media capture capabilities and `rethinkdb
manage crawl state. <https://github.com/rethinkdb/rethinkdb>`_ to manage crawl state.
Brozzler is designed to work in conjunction with warcprox for web archiving. Brozzler is designed to work in conjunction with `warcprox
<https://github.com/internetarchive/warcprox>`_ for web archiving.
Requirements Requirements
------------ ------------
@ -34,10 +32,10 @@ so Xvnc4 is preferred at this time.)
Getting Started Getting Started
--------------- ---------------
The easiest way to get started with brozzler for web archiving is with The simplest way to get started with Brozzler is to use the ``brozzle-page``
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox, brozzler command-line utility to pass in a single URL to crawl. You can also add a new
wayback, and brozzler-dashboard, configured to work with each other in a single job defined with a YAML file (see `job-const.rst`) and start a local Brozzler
process. worker for a more complex crawl.
Mac instructions: Mac instructions:
@ -48,23 +46,24 @@ Mac instructions:
# no brew? try rethinkdb's installer: https://www.rethinkdb.com/docs/install/osx/ # no brew? try rethinkdb's installer: https://www.rethinkdb.com/docs/install/osx/
rethinkdb &>>rethinkdb.log & rethinkdb &>>rethinkdb.log &
# install brozzler with special dependencies pywb and warcprox # optional: create a virtualenv
pip install brozzler[easy] # in a virtualenv if desired python -m venv .venv
# queue a site to crawl # install brozzler with rethinkdb extra
brozzler-new-site http://example.com/ pip install brozzler[rethinkdb]
# or a job # crawl a single site
brozzle-page https://example.org
# or enqueue a job and start brozzler-worker
brozzler-new-job job1.yml brozzler-new-job job1.yml
brozzler-worker
# start brozzler-easy At this point Brozzler will start archiving your site.
brozzler-easy
At this point brozzler-easy will start archiving your site. Results will be *Running Brozzler locally in this manner demonstrates the full Brozzler
immediately available for playback in pywb at http://localhost:8880/brozzler/. archival crawling workflow, but does not take advantage of Brozzler's
distributed nature.*
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
does not take advantage of brozzler's distributed nature.*
Installation and Usage Installation and Usage
---------------------- ----------------------
@ -83,7 +82,7 @@ Submit jobs::
Submit sites not tied to a job:: Submit sites not tied to a job::
brozzler-new-site --time-limit=600 http://example.com/ brozzler-new-site --time-limit=600 https://example.org/
.. [*] A note about ``--warcprox-auto``: this option tells brozzler to .. [*] A note about ``--warcprox-auto``: this option tells brozzler to
look for a healthy warcprox instance in the `rethinkdb service registry look for a healthy warcprox instance in the `rethinkdb service registry
@ -108,14 +107,14 @@ however everything else is optional. For details, see `<job-conf.rst>`_.
warcprox_meta: null warcprox_meta: null
metadata: {} metadata: {}
seeds: seeds:
- url: http://one.example.org/ - url: https://one.example.org/
- url: http://two.example.org/ - url: https://two.example.org/
time_limit: 30 time_limit: 30
- url: http://three.example.org/ - url: https://three.example.org/
time_limit: 10 time_limit: 10
ignore_robots: true ignore_robots: true
scope: scope:
surt: http://(org,example, surt: https://(org,example,
Brozzler Dashboard Brozzler Dashboard
------------------ ------------------
@ -140,47 +139,6 @@ At this point Brozzler Dashboard will be accessible at http://localhost:8000/.
See ``brozzler-dashboard --help`` for configuration options. See ``brozzler-dashboard --help`` for configuration options.
Brozzler Wayback
----------------
Brozzler comes with a customized version of `pywb
<https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb
"captures" table (populated by warcprox) as its index.
To use, first install dependencies.
::
pip install brozzler[easy]
Write a configuration file pywb.yml.
::
# 'archive_paths' should point to the output directory of warcprox
archive_paths: warcs/ # pywb will fail without a trailing slash
collections:
brozzler:
index_paths: !!python/object:brozzler.pywb.RethinkCDXSource
db: brozzler
table: captures
servers:
- localhost
enable_auto_colls: false
enable_cdx_api: true
framed_replay: true
port: 8880
Run pywb like so:
::
$ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback
Then browse http://localhost:8880/brozzler/.
.. image:: Brozzler-Wayback.png
Headless Chrome (experimental) Headless Chrome (experimental)
------------------------------ ------------------------------