mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-06-24 23:00:41 -04:00
add section about brozzler-easy to the readme
This commit is contained in:
parent
c04bf85f4e
commit
531b26aabb
2 changed files with 64 additions and 38 deletions
100
README.rst
100
README.rst
|
@ -1,4 +1,4 @@
|
||||||
.. |logo| image:: https://cdn.rawgit.com/nlevitt/brozzler/d1158ab2242815b28fe7bb066042b5b5982e4627/webconsole/static/brozzler.svg
|
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
|
||||||
:width: 7%
|
:width: 7%
|
||||||
|
|
||||||
brozzler |logo|
|
brozzler |logo|
|
||||||
|
@ -6,37 +6,76 @@ brozzler |logo|
|
||||||
|
|
||||||
"browser" \| "crawler" = "brozzler"
|
"browser" \| "crawler" = "brozzler"
|
||||||
|
|
||||||
Brozzler is a distributed web crawler (爬虫) that uses a real browser
|
Brozzler is a distributed web crawler (爬虫) that uses a real browser (chrome
|
||||||
(chrome or chromium) to fetch pages and embedded urls and to extract
|
or chromium) to fetch pages and embedded urls and to extract links. It also
|
||||||
links. It also uses `youtube-dl <https://github.com/rg3/youtube-dl>`__
|
uses `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media
|
||||||
to enhance media capture capabilities.
|
capture capabilities.
|
||||||
|
|
||||||
It is forked from https://github.com/internetarchive/umbra.
|
|
||||||
|
|
||||||
Brozzler is designed to work in conjunction with
|
Brozzler is designed to work in conjunction with
|
||||||
`warcprox <https://github.com/internetarchive/warcprox>`__ for web
|
`warcprox <https://github.com/internetarchive/warcprox>`_ for web
|
||||||
archiving.
|
archiving.
|
||||||
|
|
||||||
Installation
|
Requirements
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Brozzler requires python 3.4 or later.
|
- Python 3.4 or later
|
||||||
|
- RethinkDB deployment
|
||||||
|
- Chromium or Google Chrome browser
|
||||||
|
|
||||||
|
Worth noting is that the browser requires a graphical environment to run. You
|
||||||
|
already have this on your laptop, but on a server it will probably require
|
||||||
|
deploying some additional infrastructure (typically X11). The vagrant
|
||||||
|
configuration in the brozzler repository (still a work in progress) has an
|
||||||
|
example setup.
|
||||||
|
|
||||||
|
Getting Started
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The easiest way to get started with brozzler for web archiving is with
|
||||||
|
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox, and
|
||||||
|
`pywb <https://github.com/ikreymer/pywb>`_, configured to work with each other,
|
||||||
|
in a single process.
|
||||||
|
|
||||||
|
Mac instructions:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# set up virtualenv if desired
|
# install and start rethinkdb
|
||||||
pip install brozzler
|
brew install rethinkdb
|
||||||
|
rethinkdb &>>rethinkdb.log &
|
||||||
|
|
||||||
Brozzler also requires a rethinkdb deployment.
|
# install brozzler with special dependencies pywb and warcprox
|
||||||
|
pip install brozzler[easy] # in a virtualenv if desired
|
||||||
|
|
||||||
Usage
|
# queue a site to crawl
|
||||||
-----
|
brozzler-new-site http://example.com/
|
||||||
|
|
||||||
|
# or a job
|
||||||
|
brozzler-new-job job1.yml
|
||||||
|
|
||||||
|
# start brozzler-easy
|
||||||
|
brozzler-easy
|
||||||
|
|
||||||
|
At this point brozzler-easy will start brozzling your site. Results will be
|
||||||
|
immediately available for playback in pywb at http://localhost:8091/brozzler/.
|
||||||
|
|
||||||
|
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
|
||||||
|
does not take advantage of brozzler's distributed nature.*
|
||||||
|
|
||||||
|
Installation and Usage
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
To install brozzler only:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
pip install brozzler # in a virtualenv if desired
|
||||||
|
|
||||||
Launch one or more workers:
|
Launch one or more workers:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
brozzler-worker -e chromium
|
brozzler-worker
|
||||||
|
|
||||||
Submit jobs:
|
Submit jobs:
|
||||||
|
|
||||||
|
@ -44,6 +83,13 @@ Submit jobs:
|
||||||
|
|
||||||
brozzler-new-job myjob.yaml
|
brozzler-new-job myjob.yaml
|
||||||
|
|
||||||
|
Submit sites not tied to a job:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
|
||||||
|
--time-limit=600 http://example.com/
|
||||||
|
|
||||||
Job Configuration
|
Job Configuration
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
|
@ -70,14 +116,6 @@ must be specified, everything else is optional.
|
||||||
scope:
|
scope:
|
||||||
surt: http://(org,example,
|
surt: http://(org,example,
|
||||||
|
|
||||||
Submit a Site to Crawl Without Configuring a Job
|
|
||||||
------------------------------------------------
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
|
|
||||||
--time-limit=600 http://example.com/
|
|
||||||
|
|
||||||
Brozzler Web Console
|
Brozzler Web Console
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
@ -95,19 +133,7 @@ To start the app, run
|
||||||
|
|
||||||
brozzler-webconsole
|
brozzler-webconsole
|
||||||
|
|
||||||
|
See ``brozzler-webconsole --help`` for configuration options.
|
||||||
XXX configuration stuff
|
|
||||||
|
|
||||||
Fonts (for decent screenshots)
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
On ubuntu 14.04 trusty I installed these packages:
|
|
||||||
|
|
||||||
xfonts-base ttf-mscorefonts-installer fonts-arphic-bkai00mp
|
|
||||||
fonts-arphic-bsmi00lp fonts-arphic-gbsn00lp fonts-arphic-gkai00mp
|
|
||||||
fonts-arphic-ukai fonts-farsiweb fonts-nafees fonts-sil-abyssinica
|
|
||||||
fonts-sil-ezra fonts-sil-padauk fonts-unfonts-extra fonts-unfonts-core
|
|
||||||
ttf-indic-fonts fonts-thai-tlwg fonts-lklug-sinhala
|
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
|
2
setup.py
2
setup.py
|
@ -32,7 +32,7 @@ def find_package_data(package):
|
||||||
|
|
||||||
setuptools.setup(
|
setuptools.setup(
|
||||||
name='brozzler',
|
name='brozzler',
|
||||||
version='1.1b6.dev70',
|
version='1.1b6.dev71',
|
||||||
description='Distributed web crawling with browsers',
|
description='Distributed web crawling with browsers',
|
||||||
url='https://github.com/internetarchive/brozzler',
|
url='https://github.com/internetarchive/brozzler',
|
||||||
author='Noah Levitt',
|
author='Noah Levitt',
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue