more README.rst updates

This commit is contained in:
Gretchen Miller 2025-03-05 14:27:03 -08:00
parent d7a6d7ae66
commit 29a46a1925

View File

@ -1,6 +1,3 @@
.. image:: https://api.travis-ci.org/internetarchive/brozzler.svg?branch=master
:target: https://travis-ci.org/internetarchive/brozzler
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg
:width: 60px
@ -10,11 +7,12 @@
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
or Chromium) to fetch pages and embedded URLs and to extract links. It employs
`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance media capture
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
manage crawl state.
`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance
media capture capabilities and `rethinkdb
<https://github.com/rethinkdb/rethinkdb>`_ to manage crawl state.
Brozzler is designed to work in conjunction with warcprox for web archiving.
Brozzler is designed to work in conjunction with `warcprox
<https://github.com/internetarchive/warcprox>`_ for web archiving.
Requirements
------------
@ -34,8 +32,10 @@ so Xvnc4 is preferred at this time.)
Getting Started
---------------
The simplest way to get started with brozzler is to use the ``brozzle-page``
command-line tool to pass in a single URL to crawl.
The simplest way to get started with Brozzler is to use the ``brozzle-page``
command-line utility to pass in a single URL to crawl. You can also add a new
job defined with a YAML file (see `job-const.rst`) and start a local Brozzler
worker for a more complex crawl.
Mac instructions:
@ -49,23 +49,21 @@ Mac instructions:
# optional: create a virtualenv
python -m venv .venv
# install brozzler
pip install brozzler
# install brozzler with rethinkdb extra
pip install brozzler[rethinkdb]
# queue a site to crawl
brozzle-page http://example.com/
# crawl a single site
brozzle-page https://example.org
# or a job
# or enqueue a job and start brozzler-worker
brozzler-new-job job1.yml
# start brozzler-worker
brozzler-worker
At this point brozzler-easy will start archiving your site. Results will be
immediately available for playback in pywb at http://localhost:8880/brozzler/.
At this point Brozzler will start archiving your site.
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
does not take advantage of brozzler's distributed nature.*
*Running Brozzler locally in this manner demonstrates the full Brozzler
archival crawling workflow, but does not take advantage of Brozzler's
distributed nature.*
Installation and Usage
----------------------
@ -84,7 +82,7 @@ Submit jobs::
Submit sites not tied to a job::
brozzler-new-site --time-limit=600 http://example.com/
brozzler-new-site --time-limit=600 https://example.org/
.. [*] A note about ``--warcprox-auto``: this option tells brozzler to
look for a healthy warcprox instance in the `rethinkdb service registry
@ -109,14 +107,14 @@ however everything else is optional. For details, see `<job-conf.rst>`_.
warcprox_meta: null
metadata: {}
seeds:
- url: http://one.example.org/
- url: http://two.example.org/
- url: https://one.example.org/
- url: https://two.example.org/
time_limit: 30
- url: http://three.example.org/
- url: https://three.example.org/
time_limit: 10
ignore_robots: true
scope:
surt: http://(org,example,
surt: https://(org,example,
Brozzler Dashboard
------------------
@ -135,53 +133,12 @@ To start the app, run
brozzler-dashboard
At this point Brozzler Dashboard will be accessible at http://localhost:8000/.
At this point Brozzler Dashboard will be accessible at https://localhost:8000/.
.. image:: Brozzler-Dashboard.png
See ``brozzler-dashboard --help`` for configuration options.
Brozzler Wayback
----------------
Brozzler comes with a customized version of `pywb
<https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb
"captures" table (populated by warcprox) as its index.
To use, first install dependencies.
::
pip install brozzler[easy]
Write a configuration file pywb.yml.
::
# 'archive_paths' should point to the output directory of warcprox
archive_paths: warcs/ # pywb will fail without a trailing slash
collections:
brozzler:
index_paths: !!python/object:brozzler.pywb.RethinkCDXSource
db: brozzler
table: captures
servers:
- localhost
enable_auto_colls: false
enable_cdx_api: true
framed_replay: true
port: 8880
Run pywb like so:
::
$ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback
Then browse http://localhost:8880/brozzler/.
.. image:: Brozzler-Wayback.png
Headless Chrome (experimental)
------------------------------