mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-24 00:29:53 -05:00
more readme edits
This commit is contained in:
parent
073fc713f4
commit
771d6aa626
48
README.rst
48
README.rst
@ -11,10 +11,10 @@
|
|||||||
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
|
Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
|
||||||
or Chromium) to fetch pages and embedded URLs and to extract links. It employs
|
or Chromium) to fetch pages and embedded URLs and to extract links. It employs
|
||||||
`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
|
`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
|
||||||
capabilities, warcprox to write content to Web ARChive (WARC) files, `rethinkdb
|
capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
|
||||||
<https://github.com/rethinkdb/rethinkdb>`_ to index captured URLs, a native
|
manage crawl state.
|
||||||
dashboard for crawl job monitoring, and a customized Python Wayback interface
|
|
||||||
for archival replay.
|
Brozzler is designed to work in conjuction with warcprox for web archiving.
|
||||||
|
|
||||||
Requirements
|
Requirements
|
||||||
------------
|
------------
|
||||||
@ -24,10 +24,12 @@ Requirements
|
|||||||
- Chromium or Google Chrome >= version 64
|
- Chromium or Google Chrome >= version 64
|
||||||
|
|
||||||
Note: The browser requires a graphical environment to run. When brozzler is run
|
Note: The browser requires a graphical environment to run. When brozzler is run
|
||||||
on a server, this may require deploying some additional infrastructure
|
on a server, this may require deploying some additional infrastructure,
|
||||||
(typically X11; Xvfb does not support screenshots, however Xvnc4 from package
|
typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a
|
||||||
vnc4server, does). The `vagrant configuration <vagrant/>`_ in the brozzler
|
server, because they don't display anything to a physical screen. The `vagrant
|
||||||
repository (still a work in progress) has an example setup.
|
configuration <vagrant/>`_ in the brozzler repository has an example setup
|
||||||
|
using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots,
|
||||||
|
so Xvnc4 is preferred at this time.)
|
||||||
|
|
||||||
Getting Started
|
Getting Started
|
||||||
---------------
|
---------------
|
||||||
@ -168,35 +170,11 @@ Run pywb like so:
|
|||||||
|
|
||||||
Then browse http://localhost:8880/brozzler/.
|
Then browse http://localhost:8880/brozzler/.
|
||||||
|
|
||||||
|
|
||||||
Headless Chrome (experimental)
|
Headless Chrome (experimental)
|
||||||
--------------------------------
|
------------------------------
|
||||||
|
|
||||||
`Headless Chromium
|
Brozzler is known to work nominally with Chrome/Chromium in headless mode, but
|
||||||
<https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_
|
this has not yet been extensively tested.
|
||||||
is now available in stable Chrome releases for 64-bit Linux and may be used to
|
|
||||||
run the browser without a visible window or X11.
|
|
||||||
|
|
||||||
To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
#!/bin/bash
|
|
||||||
exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
|
|
||||||
|
|
||||||
Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
|
|
||||||
option:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
chmod +x ~/bin/chrome-headless.sh
|
|
||||||
brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
|
|
||||||
|
|
||||||
Beware: Chrome's headless mode is still very new and has `unresolved issues
|
|
||||||
<https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_.
|
|
||||||
Its use with brozzler has not yet been extensively tested. You may experience
|
|
||||||
hangs or crashes with some types of content. For the moment we recommend using
|
|
||||||
Chrome's regular mode instead.
|
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
Loading…
x
Reference in New Issue
Block a user