more readme edits

This commit is contained in:
Noah Levitt 2018-07-23 19:05:49 -05:00
parent 073fc713f4
commit 771d6aa626

View File

@ -11,10 +11,10 @@
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
or Chromium) to fetch pages and embedded URLs and to extract links. It employs or Chromium) to fetch pages and embedded URLs and to extract links. It employs
`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
capabilities, warcprox to write content to Web ARChive (WARC) files, `rethinkdb capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
<https://github.com/rethinkdb/rethinkdb>`_ to index captured URLs, a native manage crawl state.
dashboard for crawl job monitoring, and a customized Python Wayback interface
for archival replay. Brozzler is designed to work in conjuction with warcprox for web archiving.
Requirements Requirements
------------ ------------
@ -24,10 +24,12 @@ Requirements
- Chromium or Google Chrome >= version 64 - Chromium or Google Chrome >= version 64
Note: The browser requires a graphical environment to run. When brozzler is run Note: The browser requires a graphical environment to run. When brozzler is run
on a server, this may require deploying some additional infrastructure on a server, this may require deploying some additional infrastructure,
(typically X11; Xvfb does not support screenshots, however Xvnc4 from package typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a
vnc4server, does). The `vagrant configuration <vagrant/>`_ in the brozzler server, because they don't display anything to a physical screen. The `vagrant
repository (still a work in progress) has an example setup. configuration <vagrant/>`_ in the brozzler repository has an example setup
using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots,
so Xvnc4 is preferred at this time.)
Getting Started Getting Started
--------------- ---------------
@ -168,35 +170,11 @@ Run pywb like so:
Then browse http://localhost:8880/brozzler/. Then browse http://localhost:8880/brozzler/.
Headless Chrome (experimental) Headless Chrome (experimental)
-------------------------------- ------------------------------
`Headless Chromium Brozzler is known to work nominally with Chrome/Chromium in headless mode, but
<https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_ this has not yet been extensively tested.
is now available in stable Chrome releases for 64-bit Linux and may be used to
run the browser without a visible window or X11.
To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
::
#!/bin/bash
exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
option:
::
chmod +x ~/bin/chrome-headless.sh
brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
Beware: Chrome's headless mode is still very new and has `unresolved issues
<https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_.
Its use with brozzler has not yet been extensively tested. You may experience
hangs or crashes with some types of content. For the moment we recommend using
Chrome's regular mode instead.
License License
------- -------