From 6f61d0289b98ded032a4e446372915769aaf370c Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Fri, 23 May 2014 13:34:51 -0700 Subject: [PATCH] improve readme, mentioning archive-it per kristine --- README.md | 38 +++++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 58db62c..239ed64 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,38 @@ umbra ===== +Umbra is a browser automation tool, developed for the web archiving service +https://archive-it.org/. -Browser automation via chrome debug protocol +Umbra receives urls via AMQP. It opens them in the chrome or chromium browser, +with which it communicates using the chrome remote debug protocol (see +https://developer.chrome.com/devtools/docs/debugger-protocol). It runs +javascript behaviors to simulate user interaction with the page. It publishes +information about the the urls requested by the browser back to AMQP. The +format of the incoming and outgoing AMQP messages is described in `pydoc +umbra.controller`. + +Umbra can be used with the Heritrix web crawler, using these heritrix modules: +* [AMQPUrlReceiver](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/crawler/frontier/AMQPUrlReceiver.java) +* [AMQPPublishProcessor](https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/modules/AMQPPublishProcessor.java) Install -====== -Install via pip from this repo. +------ +Install via pip from this repo, e.g. + + pip install git+https://github.com/internetarchive/umbra.git + +Umbra requires an AMQP messaging service like RabbitMQ. On Ubuntu, +`sudo apt-get install rabbitmq-server` will install and start RabbitMQ +at amqp://guest:guest@localhost:5672/%2f, which the default AMQP url for umbra. Run -===== -"umbra" script should be in bin/. -load_url.py takes urls as arguments and puts them onto a rabbitmq queue -dump_queue.py prints resources discovered by the browser and sent over the return queue. +--- +The command `umbra` will start umbra with default configuration. `umbra --help` +describes all command line options. + +Umbra also comes with these utilities: +* browse-url - open urls in chrome/chromium and run behaviors (without involving AMQP) +* queue-url - send url to umbra via AMQP +* drain-queue - consume messages from AMQP queue -On ubuntu, rabbitmq install with `sudo apt-get install rabbitmq-server` should automatically -be set up for these three scripts to function on localhost ( the default amqp url ).