more readme edits

This commit is contained in:
Noah Levitt 2018-07-23 19:05:49 -05:00
parent 073fc713f4
commit 771d6aa626

View File

@ -11,10 +11,10 @@
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
or Chromium) to fetch pages and embedded URLs and to extract links. It employs
`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
capabilities, warcprox to write content to Web ARChive (WARC) files, `rethinkdb
<https://github.com/rethinkdb/rethinkdb>`_ to index captured URLs, a native
dashboard for crawl job monitoring, and a customized Python Wayback interface
for archival replay.
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
manage crawl state.
Brozzler is designed to work in conjuction with warcprox for web archiving.
Requirements
------------
@ -24,10 +24,12 @@ Requirements
- Chromium or Google Chrome >= version 64
Note: The browser requires a graphical environment to run. When brozzler is run
on a server, this may require deploying some additional infrastructure
(typically X11; Xvfb does not support screenshots, however Xvnc4 from package
vnc4server, does). The `vagrant configuration <vagrant/>`_ in the brozzler
repository (still a work in progress) has an example setup.
on a server, this may require deploying some additional infrastructure,
typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a
server, because they don't display anything to a physical screen. The `vagrant
configuration <vagrant/>`_ in the brozzler repository has an example setup
using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots,
so Xvnc4 is preferred at this time.)
Getting Started
---------------
@ -168,35 +170,11 @@ Run pywb like so:
Then browse http://localhost:8880/brozzler/.
Headless Chrome (experimental)
--------------------------------
------------------------------
`Headless Chromium
<https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_
is now available in stable Chrome releases for 64-bit Linux and may be used to
run the browser without a visible window or X11.
To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
::
#!/bin/bash
exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
option:
::
chmod +x ~/bin/chrome-headless.sh
brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
Beware: Chrome's headless mode is still very new and has `unresolved issues
<https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_.
Its use with brozzler has not yet been extensively tested. You may experience
hangs or crashes with some types of content. For the moment we recommend using
Chrome's regular mode instead.
Brozzler is known to work nominally with Chrome/Chromium in headless mode, but
this has not yet been extensively tested.
License
-------