add section about brozzler-easy to the readme

This commit is contained in:
Noah Levitt 2016-08-05 18:28:30 -07:00
parent c04bf85f4e
commit 531b26aabb
2 changed files with 64 additions and 38 deletions

View file

@ -1,4 +1,4 @@
.. |logo| image:: https://cdn.rawgit.com/nlevitt/brozzler/d1158ab2242815b28fe7bb066042b5b5982e4627/webconsole/static/brozzler.svg .. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
:width: 7% :width: 7%
brozzler |logo| brozzler |logo|
@ -6,37 +6,76 @@ brozzler |logo|
"browser" \| "crawler" = "brozzler" "browser" \| "crawler" = "brozzler"
Brozzler is a distributed web crawler (爬虫) that uses a real browser Brozzler is a distributed web crawler (爬虫) that uses a real browser (chrome
(chrome or chromium) to fetch pages and embedded urls and to extract or chromium) to fetch pages and embedded urls and to extract links. It also
links. It also uses `youtube-dl <https://github.com/rg3/youtube-dl>`__ uses `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media
to enhance media capture capabilities. capture capabilities.
It is forked from https://github.com/internetarchive/umbra.
Brozzler is designed to work in conjunction with Brozzler is designed to work in conjunction with
`warcprox <https://github.com/internetarchive/warcprox>`__ for web `warcprox <https://github.com/internetarchive/warcprox>`_ for web
archiving. archiving.
Installation Requirements
------------ ------------
Brozzler requires python 3.4 or later. - Python 3.4 or later
- RethinkDB deployment
- Chromium or Google Chrome browser
Worth noting is that the browser requires a graphical environment to run. You
already have this on your laptop, but on a server it will probably require
deploying some additional infrastructure (typically X11). The vagrant
configuration in the brozzler repository (still a work in progress) has an
example setup.
Getting Started
---------------
The easiest way to get started with brozzler for web archiving is with
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox, and
`pywb <https://github.com/ikreymer/pywb>`_, configured to work with each other,
in a single process.
Mac instructions:
:: ::
# set up virtualenv if desired # install and start rethinkdb
pip install brozzler brew install rethinkdb
rethinkdb &>>rethinkdb.log &
Brozzler also requires a rethinkdb deployment. # install brozzler with special dependencies pywb and warcprox
pip install brozzler[easy] # in a virtualenv if desired
Usage # queue a site to crawl
----- brozzler-new-site http://example.com/
# or a job
brozzler-new-job job1.yml
# start brozzler-easy
brozzler-easy
At this point brozzler-easy will start brozzling your site. Results will be
immediately available for playback in pywb at http://localhost:8091/brozzler/.
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
does not take advantage of brozzler's distributed nature.*
Installation and Usage
----------------------
To install brozzler only:
::
pip install brozzler # in a virtualenv if desired
Launch one or more workers: Launch one or more workers:
:: ::
brozzler-worker -e chromium brozzler-worker
Submit jobs: Submit jobs:
@ -44,6 +83,13 @@ Submit jobs:
brozzler-new-job myjob.yaml brozzler-new-job myjob.yaml
Submit sites not tied to a job:
::
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
--time-limit=600 http://example.com/
Job Configuration Job Configuration
----------------- -----------------
@ -70,14 +116,6 @@ must be specified, everything else is optional.
scope: scope:
surt: http://(org,example, surt: http://(org,example,
Submit a Site to Crawl Without Configuring a Job
------------------------------------------------
::
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
--time-limit=600 http://example.com/
Brozzler Web Console Brozzler Web Console
-------------------- --------------------
@ -95,19 +133,7 @@ To start the app, run
brozzler-webconsole brozzler-webconsole
See ``brozzler-webconsole --help`` for configuration options.
XXX configuration stuff
Fonts (for decent screenshots)
------------------------------
On ubuntu 14.04 trusty I installed these packages:
xfonts-base ttf-mscorefonts-installer fonts-arphic-bkai00mp
fonts-arphic-bsmi00lp fonts-arphic-gbsn00lp fonts-arphic-gkai00mp
fonts-arphic-ukai fonts-farsiweb fonts-nafees fonts-sil-abyssinica
fonts-sil-ezra fonts-sil-padauk fonts-unfonts-extra fonts-unfonts-core
ttf-indic-fonts fonts-thai-tlwg fonts-lklug-sinhala
License License
------- -------

View file

@ -32,7 +32,7 @@ def find_package_data(package):
setuptools.setup( setuptools.setup(
name='brozzler', name='brozzler',
version='1.1b6.dev70', version='1.1b6.dev71',
description='Distributed web crawling with browsers', description='Distributed web crawling with browsers',
url='https://github.com/internetarchive/brozzler', url='https://github.com/internetarchive/brozzler',
author='Noah Levitt', author='Noah Levitt',