65 Commits

Author SHA1 Message Date
Noah Levitt
1e56bc8686 add only one site at a time, specify settings with command line switches 2015-07-21 06:32:00 +00:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site 2015-07-16 14:48:01 -07:00
Noah Levitt
923cd98652 save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl) 2015-07-15 16:32:02 -07:00
Noah Levitt
5aea76ab6d refactor worker code into worker module 2015-07-15 15:42:40 -07:00
Noah Levitt
7b92ba39c7 avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately) 2015-07-15 14:33:22 -07:00
Noah Levitt
9b13f0c34c refactor hq code into hq module 2015-07-15 14:27:21 -07:00
Noah Levitt
9b5da57d7e initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled 2015-07-14 18:57:45 -07:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Noah Levitt
3eff099b16 determine if youtube-dl can do something with a url 2015-07-13 16:40:56 -07:00
Noah Levitt
6470a8ef26 sigquit dumps thread traces 2015-07-13 15:57:14 -07:00
Noah Levitt
18ca996216 rudimentary robots.txt support 2015-07-13 15:56:54 -07:00
Noah Levitt
eb74967fed brozzler-worker round-robins sites needing crawling 2015-07-13 12:13:41 -07:00
Noah Levitt
ddd764cac5 brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox) 2015-07-11 23:07:47 -07:00
Noah Levitt
b0f3b8a5e3 clean shutdown for brozzler-hq 2015-07-11 18:18:54 -07:00
Noah Levitt
384120928c set in_progress=0 for completed url 2015-07-11 13:24:38 -07:00
Noah Levitt
610f9c8cf4 add missing file hq.py, improve some logging, fix little race condition bug 2015-07-11 13:09:45 -07:00
Noah Levitt
bb3561a690 check scope (on hq side), fix buglets 2015-07-11 12:33:19 -07:00
Noah Levitt
1fb336cb2e crawling outlinks not totally working 2015-07-11 02:29:19 -07:00
Noah Levitt
56a7bb7306 submit outlinks to hq 2015-07-10 21:31:41 -07:00
Noah Levitt
fd99764baa brozzler-worker partially working 2015-07-10 21:07:47 -07:00
Noah Levitt
8aa1e6715a feed seed url to the crawl url queue 2015-07-10 20:12:33 -07:00
Noah Levitt
1d068f4f86 starting work on brozzler crawl hq 2015-07-10 18:01:54 -07:00
Noah Levitt
fcc63b6675 fancier prioritization takes into account hops from seed, path depth; and clean shutdown 2015-07-09 22:35:37 -07:00
Noah Levitt
5f3c247e0c trick to avoid crawling same url again too quickly 2015-07-09 21:49:55 -07:00
Noah Levitt
7cc777661d fix dumb bug 2015-07-09 18:54:09 -07:00
Noah Levitt
783794ca37 basic of site/seed crawling with scoping 2015-07-09 18:36:07 -07:00
Noah Levitt
92ea701987 rudimentary crawling in parallel with multiple browsers 2015-07-08 18:50:18 -07:00
Noah Levitt
4022cc0162 simple in-memory frontier with prioritized queues by host 2015-07-08 17:44:38 -07:00
Noah Levitt
4042f22497 rudimentary link extraction and crawling 2015-07-07 16:45:52 -07:00
Noah Levitt
d8a962b29e experimenting with captureScreenshot 2015-06-16 18:42:21 -07:00
Noah Levitt
9053279b4e change default routing key to "urls" 2014-11-03 11:54:59 -08:00
Noah Levitt
2ab767eaa9 make drain-queue output actual json instead of python dict syntax 2014-08-26 23:46:00 +00:00
Noah Levitt
fe1d9e01eb utility queue-json to publish an arbitrary json blob to amqp 2014-08-26 23:45:42 +00:00
Noah Levitt
6306c16698 kill -HUP to immediately close and reopen amqp consumer connection 2014-06-23 17:18:27 -07:00
Noah Levitt
9b32f9a3d1 ugh, it was better with the default width, in spite of the ridiculous behavior.script 2014-06-20 14:40:12 -07:00
Noah Levitt
2cf69bdaff seriously, don't try to wrap any lines, pprint 2014-06-20 14:37:33 -07:00
Noah Levitt
c6fa00812c when dumping state on SIGQUIT, build the whole string before printing to avoid stuff getting intermingled with other logging and stuff 2014-06-20 14:33:01 -07:00
Noah Levitt
ead46d5716 more elaborate dumping of state on SIGQUIT to replace faulthandler 2014-06-20 14:05:33 -07:00
Noah Levitt
ebb14ff889 get rid of chrome_wait straggler 2014-06-18 17:31:28 -07:00
Noah Levitt
025db91dea get rid of --browser-wait and --routing-key in favor of sensible defaults, some other tweaks 2014-06-11 10:58:08 -07:00
Noah Levitt
a78e60f1da wait for a browser to become available and start it up before reading the next url from amqp; ack the message only after completing the browsing process successfully, and requeue if it's not successful; some refactoring to make the timing work for this 2014-06-09 13:15:05 -07:00
Noah Levitt
b2e27b99d2 nice log message when fully shut down 2014-05-30 17:32:01 -07:00
Noah Levitt
c9d503e690 log version number at startup 2014-05-30 15:00:01 -07:00
Noah Levitt
3127e02cbb fancy --version that includes git branch and timestamp of last commit if available 2014-05-29 20:43:00 -07:00
Noah Levitt
9c08be2699 sigterm and sigint both shutdown request shutdown, which stops consuming urls and waits for active browsers to finish; a second sigint/sigterm immediately shuts down active browsers 2014-05-24 01:52:22 -07:00
Noah Levitt
2c4ba005b5 make umbra amenable to clustering by using a pool of n browsers and removing the browser-clientId affinity (not useful currently since we start a fresh browser instance for each page browsed), and set prefetch_count=1 on amqp consumers to round-robin incoming urls among umbra instances 2014-05-23 21:59:34 -07:00
Noah Levitt
8d269f4c56 add options --verbose, --exchange, --queue, --routing-key 2014-05-23 13:39:39 -07:00
Noah Levitt
bd3f979b56 capitalize AMQP in description 2014-05-23 13:39:08 -07:00