mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-23 16:19:49 -05:00
more readme edits
This commit is contained in:
parent
073fc713f4
commit
771d6aa626
48
README.rst
48
README.rst
@ -11,10 +11,10 @@
|
||||
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
|
||||
or Chromium) to fetch pages and embedded URLs and to extract links. It employs
|
||||
`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
|
||||
capabilities, warcprox to write content to Web ARChive (WARC) files, `rethinkdb
|
||||
<https://github.com/rethinkdb/rethinkdb>`_ to index captured URLs, a native
|
||||
dashboard for crawl job monitoring, and a customized Python Wayback interface
|
||||
for archival replay.
|
||||
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
|
||||
manage crawl state.
|
||||
|
||||
Brozzler is designed to work in conjuction with warcprox for web archiving.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
@ -24,10 +24,12 @@ Requirements
|
||||
- Chromium or Google Chrome >= version 64
|
||||
|
||||
Note: The browser requires a graphical environment to run. When brozzler is run
|
||||
on a server, this may require deploying some additional infrastructure
|
||||
(typically X11; Xvfb does not support screenshots, however Xvnc4 from package
|
||||
vnc4server, does). The `vagrant configuration <vagrant/>`_ in the brozzler
|
||||
repository (still a work in progress) has an example setup.
|
||||
on a server, this may require deploying some additional infrastructure,
|
||||
typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a
|
||||
server, because they don't display anything to a physical screen. The `vagrant
|
||||
configuration <vagrant/>`_ in the brozzler repository has an example setup
|
||||
using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots,
|
||||
so Xvnc4 is preferred at this time.)
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
@ -168,35 +170,11 @@ Run pywb like so:
|
||||
|
||||
Then browse http://localhost:8880/brozzler/.
|
||||
|
||||
|
||||
Headless Chrome (experimental)
|
||||
--------------------------------
|
||||
------------------------------
|
||||
|
||||
`Headless Chromium
|
||||
<https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_
|
||||
is now available in stable Chrome releases for 64-bit Linux and may be used to
|
||||
run the browser without a visible window or X11.
|
||||
|
||||
To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
|
||||
|
||||
::
|
||||
|
||||
#!/bin/bash
|
||||
exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
|
||||
|
||||
Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
|
||||
option:
|
||||
|
||||
::
|
||||
|
||||
chmod +x ~/bin/chrome-headless.sh
|
||||
brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
|
||||
|
||||
Beware: Chrome's headless mode is still very new and has `unresolved issues
|
||||
<https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_.
|
||||
Its use with brozzler has not yet been extensively tested. You may experience
|
||||
hangs or crashes with some types of content. For the moment we recommend using
|
||||
Chrome's regular mode instead.
|
||||
Brozzler is known to work nominally with Chrome/Chromium in headless mode, but
|
||||
this has not yet been extensively tested.
|
||||
|
||||
License
|
||||
-------
|
||||
|
Loading…
x
Reference in New Issue
Block a user