brozzler/README.md

39 lines
1.6 KiB
Markdown
Raw Normal View History

2014-01-21 01:43:16 -05:00
umbra
=====
Umbra is a browser automation tool, developed for the web archiving service
https://archive-it.org/.
2014-01-21 01:43:16 -05:00
Umbra receives urls via AMQP. It opens them in the chrome or chromium browser,
with which it communicates using the chrome remote debug protocol (see
https://developer.chrome.com/devtools/docs/debugger-protocol). It runs
javascript behaviors to simulate user interaction with the page. It publishes
information about the the urls requested by the browser back to AMQP. The
format of the incoming and outgoing AMQP messages is described in `pydoc
umbra.controller`.
Umbra can be used with the Heritrix web crawler, using these heritrix modules:
* [AMQPUrlReceiver](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/crawler/frontier/AMQPUrlReceiver.java)
* [AMQPPublishProcessor](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/modules/AMQPPublishProcessor.java)
2014-01-21 01:43:16 -05:00
Install
------
Install via pip from this repo, e.g.
pip install git+https://github.com/internetarchive/umbra.git
Umbra requires an AMQP messaging service like RabbitMQ. On Ubuntu,
`sudo apt-get install rabbitmq-server` will install and start RabbitMQ
at amqp://guest:guest@localhost:5672/%2f, which the default AMQP url for umbra.
2014-01-21 21:36:14 +00:00
Run
---
The command `umbra` will start umbra with default configuration. `umbra --help`
describes all command line options.
Umbra also comes with these utilities:
* browse-url - open urls in chrome/chromium and run behaviors (without involving AMQP)
* queue-url - send url to umbra via AMQP
* drain-queue - consume messages from AMQP queue
2014-01-21 21:36:14 +00:00
2014-01-21 01:43:16 -05:00