915 Commits

Author SHA1 Message Date
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
4624f47402 Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
2015-08-03 12:39:26 -07:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
f5acb6c34b make requests library dependency explicity 2015-08-01 01:25:07 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00
Lauren Ko
d4a783285e Adds routing_key to queue Queue creation 2015-07-31 14:15:18 -05:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
8366bd2d66 refactor to simplify run() 2015-07-28 01:12:41 +00:00
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a0a0b0ff2c use nlevitt brozzler branch of youtube-dl 2015-07-28 01:10:39 +00:00
Noah Levitt
060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) 2015-07-24 01:40:35 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) 2015-07-23 00:44:33 +00:00
Noah Levitt
5d5151584c fix another dumb little bug in handling exceptions from youtube_dl 2015-07-23 00:41:26 +00:00
Noah Levitt
85a863b1e3 change argument to --amqp-url for clarity and consistency 2015-07-23 00:39:57 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
f00571f7bd fix youtube-dl exception handling 2015-07-22 00:53:39 +00:00
Noah Levitt
83a8e7cbe5 fix bug when --extra-header switch is not supplied 2015-07-21 20:39:41 +00:00
Noah Levitt
f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening 2015-07-21 20:39:19 +00:00
Noah Levitt
88f352efea use new fork of youtube-dl with support for extra http headers on every request 2015-07-21 19:23:01 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
1e56bc8686 add only one site at a time, specify settings with command line switches 2015-07-21 06:32:00 +00:00
Noah Levitt
38ddfe498d require my "tweaks" branch of websocket-client 2015-07-20 16:06:47 -07:00
Noah Levitt
dc04048d50 add some info to the readme 2015-07-20 12:00:14 -07:00
Noah Levitt
2f28f00a09 make putmeta requests respect site configured extra_headers 2015-07-17 16:52:06 -07:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
c178ed1950 fix buglet 2015-07-16 18:43:14 -07:00
Noah Levitt
a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix 2015-07-16 18:39:29 -07:00
Noah Levitt
d2650a2547 update scope if seed redirects 2015-07-16 18:27:47 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site 2015-07-16 14:48:01 -07:00
Noah Levitt
6b2ee9faee chmod -x worker.py 2015-07-15 18:03:49 -07:00
Noah Levitt
f2bc7ec271 refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling 2015-07-15 18:03:03 -07:00
Noah Levitt
a9c51edd84 robots cache per site, and some so far unused support for site level configuration 2015-07-15 17:44:42 -07:00
Noah Levitt
efa3cd6269 don't set http_proxy environment variable, because it affects things we don't want it to 2015-07-15 17:33:29 -07:00
Noah Levitt
923cd98652 save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl) 2015-07-15 16:32:02 -07:00
Noah Levitt
5aea76ab6d refactor worker code into worker module 2015-07-15 15:42:40 -07:00
Noah Levitt
7b92ba39c7 avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately) 2015-07-15 14:33:22 -07:00
Noah Levitt
9b13f0c34c refactor hq code into hq module 2015-07-15 14:27:21 -07:00
Noah Levitt
4cfb287397 refactor hq code into hq module 2015-07-15 14:26:48 -07:00
Noah Levitt
9b5da57d7e initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled 2015-07-14 18:57:45 -07:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Noah Levitt
3eff099b16 determine if youtube-dl can do something with a url 2015-07-13 16:40:56 -07:00
Noah Levitt
6470a8ef26 sigquit dumps thread traces 2015-07-13 15:57:14 -07:00
Noah Levitt
18ca996216 rudimentary robots.txt support 2015-07-13 15:56:54 -07:00
Noah Levitt
eb74967fed brozzler-worker round-robins sites needing crawling 2015-07-13 12:13:41 -07:00
Noah Levitt
ddd764cac5 brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox) 2015-07-11 23:07:47 -07:00