mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-24 08:39:59 -05:00
39 lines
1.6 KiB
Markdown
39 lines
1.6 KiB
Markdown
umbra
|
|
=====
|
|
Umbra is a browser automation tool, developed for the web archiving service
|
|
https://archive-it.org/.
|
|
|
|
Umbra receives urls via AMQP. It opens them in the chrome or chromium browser,
|
|
with which it communicates using the chrome remote debug protocol (see
|
|
https://developer.chrome.com/devtools/docs/debugger-protocol). It runs
|
|
javascript behaviors to simulate user interaction with the page. It publishes
|
|
information about the the urls requested by the browser back to AMQP. The
|
|
format of the incoming and outgoing AMQP messages is described in `pydoc
|
|
umbra.controller`.
|
|
|
|
Umbra can be used with the Heritrix web crawler, using these heritrix modules:
|
|
* [AMQPUrlReceiver](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/crawler/frontier/AMQPUrlReceiver.java)
|
|
* [AMQPPublishProcessor](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/modules/AMQPPublishProcessor.java)
|
|
|
|
Install
|
|
------
|
|
Install via pip from this repo, e.g.
|
|
|
|
pip install git+https://github.com/internetarchive/umbra.git
|
|
|
|
Umbra requires an AMQP messaging service like RabbitMQ. On Ubuntu,
|
|
`sudo apt-get install rabbitmq-server` will install and start RabbitMQ
|
|
at amqp://guest:guest@localhost:5672/%2f, which the default AMQP url for umbra.
|
|
|
|
Run
|
|
---
|
|
The command `umbra` will start umbra with default configuration. `umbra --help`
|
|
describes all command line options.
|
|
|
|
Umbra also comes with these utilities:
|
|
* browse-url - open urls in chrome/chromium and run behaviors (without involving AMQP)
|
|
* queue-url - send url to umbra via AMQP
|
|
* drain-queue - consume messages from AMQP queue
|
|
|
|
|