From 29a46a19255eefe59587407fb5fd44ef9ed2770f Mon Sep 17 00:00:00 2001 From: Gretchen Miller Date: Wed, 5 Mar 2025 14:27:03 -0800 Subject: [PATCH] more README.rst updates --- README.rst | 91 ++++++++++++++---------------------------------------- 1 file changed, 24 insertions(+), 67 deletions(-) diff --git a/README.rst b/README.rst index e2dba4a..3e04c94 100644 --- a/README.rst +++ b/README.rst @@ -1,6 +1,3 @@ -.. image:: https://api.travis-ci.org/internetarchive/brozzler.svg?branch=master - :target: https://travis-ci.org/internetarchive/brozzler - .. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg :width: 60px @@ -10,11 +7,12 @@ Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It employs -`yt-dlp `_ (formerly youtube-dl) to enhance media capture -capabilities and `rethinkdb `_ to -manage crawl state. +`yt-dlp `_ (formerly youtube-dl) to enhance +media capture capabilities and `rethinkdb +`_ to manage crawl state. -Brozzler is designed to work in conjunction with warcprox for web archiving. +Brozzler is designed to work in conjunction with `warcprox +`_ for web archiving. Requirements ------------ @@ -34,8 +32,10 @@ so Xvnc4 is preferred at this time.) Getting Started --------------- -The simplest way to get started with brozzler is to use the ``brozzle-page`` -command-line tool to pass in a single URL to crawl. +The simplest way to get started with Brozzler is to use the ``brozzle-page`` +command-line utility to pass in a single URL to crawl. You can also add a new +job defined with a YAML file (see `job-const.rst`) and start a local Brozzler +worker for a more complex crawl. Mac instructions: @@ -49,23 +49,21 @@ Mac instructions: # optional: create a virtualenv python -m venv .venv - # install brozzler - pip install brozzler + # install brozzler with rethinkdb extra + pip install brozzler[rethinkdb] - # queue a site to crawl - brozzle-page http://example.com/ + # crawl a single site + brozzle-page https://example.org - # or a job + # or enqueue a job and start brozzler-worker brozzler-new-job job1.yml - - # start brozzler-worker brozzler-worker -At this point brozzler-easy will start archiving your site. Results will be -immediately available for playback in pywb at http://localhost:8880/brozzler/. +At this point Brozzler will start archiving your site. -*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but -does not take advantage of brozzler's distributed nature.* +*Running Brozzler locally in this manner demonstrates the full Brozzler +archival crawling workflow, but does not take advantage of Brozzler's +distributed nature.* Installation and Usage ---------------------- @@ -84,7 +82,7 @@ Submit jobs:: Submit sites not tied to a job:: - brozzler-new-site --time-limit=600 http://example.com/ + brozzler-new-site --time-limit=600 https://example.org/ .. [*] A note about ``--warcprox-auto``: this option tells brozzler to look for a healthy warcprox instance in the `rethinkdb service registry @@ -109,14 +107,14 @@ however everything else is optional. For details, see ``_. warcprox_meta: null metadata: {} seeds: - - url: http://one.example.org/ - - url: http://two.example.org/ + - url: https://one.example.org/ + - url: https://two.example.org/ time_limit: 30 - - url: http://three.example.org/ + - url: https://three.example.org/ time_limit: 10 ignore_robots: true scope: - surt: http://(org,example, + surt: https://(org,example, Brozzler Dashboard ------------------ @@ -135,53 +133,12 @@ To start the app, run brozzler-dashboard -At this point Brozzler Dashboard will be accessible at http://localhost:8000/. +At this point Brozzler Dashboard will be accessible at https://localhost:8000/. .. image:: Brozzler-Dashboard.png See ``brozzler-dashboard --help`` for configuration options. -Brozzler Wayback ----------------- - -Brozzler comes with a customized version of `pywb -`_, which supports using the rethinkdb -"captures" table (populated by warcprox) as its index. - -To use, first install dependencies. - -:: - - pip install brozzler[easy] - -Write a configuration file pywb.yml. - -:: - - # 'archive_paths' should point to the output directory of warcprox - archive_paths: warcs/ # pywb will fail without a trailing slash - collections: - brozzler: - index_paths: !!python/object:brozzler.pywb.RethinkCDXSource - db: brozzler - table: captures - servers: - - localhost - enable_auto_colls: false - enable_cdx_api: true - framed_replay: true - port: 8880 - -Run pywb like so: - -:: - - $ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback - -Then browse http://localhost:8880/brozzler/. - -.. image:: Brozzler-Wayback.png - Headless Chrome (experimental) ------------------------------