From 29a46a19255eefe59587407fb5fd44ef9ed2770f Mon Sep 17 00:00:00 2001
From: Gretchen Miller <gretchen@gretchenleigh.zone>
Date: Wed, 5 Mar 2025 14:27:03 -0800
Subject: [PATCH] more README.rst updates

---
 README.rst | 91 ++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 67 deletions(-)

diff --git a/README.rst b/README.rst
index e2dba4a..3e04c94 100644
--- a/README.rst
+++ b/README.rst
@@ -1,6 +1,3 @@
-.. image:: https://api.travis-ci.org/internetarchive/brozzler.svg?branch=master
-    :target: https://travis-ci.org/internetarchive/brozzler
-
 .. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg
    :width: 60px
 
@@ -10,11 +7,12 @@
 
 Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
 or Chromium) to fetch pages and embedded URLs and to extract links. It employs
-`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance media capture
-capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
-manage crawl state.
+`yt-dlp <https://github.com/yt-dlp/yt-dlp>`_ (formerly youtube-dl) to enhance
+media capture capabilities and `rethinkdb
+<https://github.com/rethinkdb/rethinkdb>`_ to manage crawl state.
 
-Brozzler is designed to work in conjunction with warcprox for web archiving.
+Brozzler is designed to work in conjunction with `warcprox
+<https://github.com/internetarchive/warcprox>`_ for web archiving.
 
 Requirements
 ------------
@@ -34,8 +32,10 @@ so Xvnc4 is preferred at this time.)
 Getting Started
 ---------------
 
-The simplest way to get started with brozzler is to use the ``brozzle-page``
-command-line tool to pass in a single URL to crawl.
+The simplest way to get started with Brozzler is to use the ``brozzle-page``
+command-line utility to pass in a single URL to crawl. You can also add a new
+job defined with a YAML file (see `job-const.rst`) and start a local Brozzler
+worker for a more complex crawl.
 
 Mac instructions:
 
@@ -49,23 +49,21 @@ Mac instructions:
     # optional: create a virtualenv
     python -m venv .venv
 
-    # install brozzler
-    pip install brozzler
+    # install brozzler with rethinkdb extra
+    pip install brozzler[rethinkdb]
 
-    # queue a site to crawl
-    brozzle-page http://example.com/
+    # crawl a single site
+    brozzle-page https://example.org
 
-    # or a job
+    # or enqueue a job and start brozzler-worker
     brozzler-new-job job1.yml
-
-    # start brozzler-worker
     brozzler-worker
 
-At this point brozzler-easy will start archiving your site. Results will be
-immediately available for playback in pywb at http://localhost:8880/brozzler/.
+At this point Brozzler will start archiving your site.
 
-*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
-does not take advantage of brozzler's distributed nature.*
+*Running Brozzler locally in this manner demonstrates the full Brozzler
+archival crawling workflow, but does not take advantage of Brozzler's
+distributed nature.*
 
 Installation and Usage
 ----------------------
@@ -84,7 +82,7 @@ Submit jobs::
 
 Submit sites not tied to a job::
 
-    brozzler-new-site --time-limit=600 http://example.com/
+    brozzler-new-site --time-limit=600 https://example.org/
 
 .. [*] A note about ``--warcprox-auto``: this option tells brozzler to
    look for a healthy warcprox instance in the `rethinkdb service registry
@@ -109,14 +107,14 @@ however everything else is optional. For details, see `<job-conf.rst>`_.
     warcprox_meta: null
     metadata: {}
     seeds:
-      - url: http://one.example.org/
-      - url: http://two.example.org/
+      - url: https://one.example.org/
+      - url: https://two.example.org/
         time_limit: 30
-      - url: http://three.example.org/
+      - url: https://three.example.org/
         time_limit: 10
         ignore_robots: true
         scope:
-          surt: http://(org,example,
+          surt: https://(org,example,
 
 Brozzler Dashboard
 ------------------
@@ -135,53 +133,12 @@ To start the app, run
 
     brozzler-dashboard
 
-At this point Brozzler Dashboard will be accessible at http://localhost:8000/.
+At this point Brozzler Dashboard will be accessible at https://localhost:8000/.
 
 .. image:: Brozzler-Dashboard.png
 
 See ``brozzler-dashboard --help`` for configuration options.
 
-Brozzler Wayback
-----------------
-
-Brozzler comes with a customized version of `pywb
-<https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb
-"captures" table (populated by warcprox) as its index.
-
-To use, first install dependencies.
-
-::
-
-    pip install brozzler[easy]
-
-Write a configuration file pywb.yml.
-
-::
-
-    # 'archive_paths' should point to the output directory of warcprox
-    archive_paths: warcs/  # pywb will fail without a trailing slash
-    collections:
-      brozzler:
-        index_paths: !!python/object:brozzler.pywb.RethinkCDXSource
-          db: brozzler
-          table: captures
-          servers:
-          - localhost
-    enable_auto_colls: false
-    enable_cdx_api: true
-    framed_replay: true
-    port: 8880
-
-Run pywb like so:
-
-::
-
-    $ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback
-
-Then browse http://localhost:8880/brozzler/.
-
-.. image:: Brozzler-Wayback.png
-
 Headless Chrome (experimental)
 ------------------------------