1530 Commits

Author SHA1 Message Date
Noah Levitt
ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception 2015-08-19 20:04:46 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py 2015-08-19 18:45:19 +00:00
Noah Levitt
b8506a2ab4 rename "db" to "frontier" 2015-08-19 17:47:05 +00:00
Noah Levitt
cd3a644298 switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check 2015-08-19 00:04:20 +00:00
Noah Levitt
382c826678 rethinkdb connection per request, to server chosen randomly from list 2015-08-18 23:47:28 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
e6fbf0e2e9 rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al 2015-08-17 22:48:25 +00:00
Noah Levitt
6b6583e63a more notes on choosing a db 2015-08-13 01:01:35 +00:00
Noah Levitt
e68c98e66d brozzle a site for 5 minutes at a time instead of 1 for now 2015-08-11 18:15:16 +00:00
Noah Levitt
fc75e18928 handle "aw snap" or "he's dead jim" from chrome 2015-08-11 18:14:53 +00:00
Noah Levitt
3d70776ce3 some thoughts on distributed database 2015-08-11 18:06:58 +00:00
Noah Levitt
ce154fc3db more robustness improvements 2015-08-10 20:11:46 +00:00
Noah Levitt
e96b16e19a support for max_hops scope rule 2015-08-07 22:36:39 +00:00
Noah Levitt
a47292dab5 thread to read and selectively log output from chrome 2015-08-07 22:36:07 +00:00
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
4624f47402 Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
2015-08-03 12:39:26 -07:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
f5acb6c34b make requests library dependency explicity 2015-08-01 01:25:07 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00
Lauren Ko
d4a783285e Adds routing_key to queue Queue creation 2015-07-31 14:15:18 -05:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
8366bd2d66 refactor to simplify run() 2015-07-28 01:12:41 +00:00
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a0a0b0ff2c use nlevitt brozzler branch of youtube-dl 2015-07-28 01:10:39 +00:00
Noah Levitt
060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) 2015-07-24 01:40:35 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) 2015-07-23 00:44:33 +00:00
Noah Levitt
5d5151584c fix another dumb little bug in handling exceptions from youtube_dl 2015-07-23 00:41:26 +00:00
Noah Levitt
85a863b1e3 change argument to --amqp-url for clarity and consistency 2015-07-23 00:39:57 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
f00571f7bd fix youtube-dl exception handling 2015-07-22 00:53:39 +00:00
Noah Levitt
83a8e7cbe5 fix bug when --extra-header switch is not supplied 2015-07-21 20:39:41 +00:00
Noah Levitt
f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening 2015-07-21 20:39:19 +00:00
Noah Levitt
88f352efea use new fork of youtube-dl with support for extra http headers on every request 2015-07-21 19:23:01 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
1e56bc8686 add only one site at a time, specify settings with command line switches 2015-07-21 06:32:00 +00:00
Noah Levitt
38ddfe498d require my "tweaks" branch of websocket-client 2015-07-20 16:06:47 -07:00
Noah Levitt
dc04048d50 add some info to the readme 2015-07-20 12:00:14 -07:00
Noah Levitt
2f28f00a09 make putmeta requests respect site configured extra_headers 2015-07-17 16:52:06 -07:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
c178ed1950 fix buglet 2015-07-16 18:43:14 -07:00
Noah Levitt
a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix 2015-07-16 18:39:29 -07:00
Noah Levitt
d2650a2547 update scope if seed redirects 2015-07-16 18:27:47 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site 2015-07-16 14:48:01 -07:00
Noah Levitt
6b2ee9faee chmod -x worker.py 2015-07-15 18:03:49 -07:00