mirror of https://github.com/internetarchive/brozzler.git synced 2025-08-22 04:39:36 -04:00

brozzler - distributed browser-based web crawler

Find a file

Noah Levitt a62a07e6b7 change magic first line of behavior js files to a commented-out json blob, which should include the fields 'url_regex' and 'request_idle_timeout_sec'; behavior.is_finished() incorporates the custom idle timeout into its check; also rename variables in behavior scripts with umbra/UMBRA_ prefix to sort of namespace them; and add "finished" logic to facebook and vimeo behaviors (flickr needs work to support it)		2014-05-05 11:58:55 -07:00
bin	handle multiple clients, browsers	2014-02-13 01:59:09 -08:00
umbra	change magic first line of behavior js files to a commented-out json blob, which should include the fields 'url_regex' and 'request_idle_timeout_sec'; behavior.is_finished() incorporates the custom idle timeout into its check; also rename variables in behavior scripts with umbra/UMBRA_ prefix to sort of namespace them; and add "finished" logic to facebook and vimeo behaviors (flickr needs work to support it)	2014-05-05 11:58:55 -07:00
.gitignore	Some refactor/testing and utility scripts	2014-01-22 18:03:02 +00:00
README.md	Update readme	2014-01-28 00:12:33 -05:00
setup.py	setup.py - include behaviors.d/*.js in installation	2014-03-13 00:00:32 -07:00

README.md

umbra

Browser automation via chrome debug protocol

Install

Install via pip from this repo.

Run

"umbra" script should be in bin/. load_url.py takes urls as arguments and puts them onto a rabbitmq queue dump_queue.py prints resources discovered by the browser and sent over the return queue.

On ubuntu, rabbitmq install with sudo apt-get install rabbitmq-server should automatically be set up for these three scripts to function on localhost ( the default amqp url ).