484 Commits

Author SHA1 Message Date
Noah Levitt
ee50818dca if database already exists but tables don't, just create them 2015-08-20 21:23:08 +00:00
Noah Levitt
3af1e10e13 make it work again, and list discovered outlinks 2015-08-20 21:22:08 +00:00
Noah Levitt
8b45d7eb69 since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch 2015-08-19 22:50:04 +00:00
Noah Levitt
ad543e6134 enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect 2015-08-19 20:16:25 +00:00
Noah Levitt
ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception 2015-08-19 20:04:46 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py 2015-08-19 18:45:19 +00:00
Noah Levitt
b8506a2ab4 rename "db" to "frontier" 2015-08-19 17:47:05 +00:00
Noah Levitt
cd3a644298 switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check 2015-08-19 00:04:20 +00:00
Noah Levitt
382c826678 rethinkdb connection per request, to server chosen randomly from list 2015-08-18 23:47:28 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
e6fbf0e2e9 rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al 2015-08-17 22:48:25 +00:00
Noah Levitt
6b6583e63a more notes on choosing a db 2015-08-13 01:01:35 +00:00
Noah Levitt
e68c98e66d brozzle a site for 5 minutes at a time instead of 1 for now 2015-08-11 18:15:16 +00:00
Noah Levitt
fc75e18928 handle "aw snap" or "he's dead jim" from chrome 2015-08-11 18:14:53 +00:00
Noah Levitt
3d70776ce3 some thoughts on distributed database 2015-08-11 18:06:58 +00:00
Noah Levitt
ce154fc3db more robustness improvements 2015-08-10 20:11:46 +00:00
Noah Levitt
e96b16e19a support for max_hops scope rule 2015-08-07 22:36:39 +00:00
Noah Levitt
a47292dab5 thread to read and selectively log output from chrome 2015-08-07 22:36:07 +00:00
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
4624f47402 Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
2015-08-03 12:39:26 -07:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
f5acb6c34b make requests library dependency explicity 2015-08-01 01:25:07 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00
Lauren Ko
d4a783285e Adds routing_key to queue Queue creation 2015-07-31 14:15:18 -05:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
8366bd2d66 refactor to simplify run() 2015-07-28 01:12:41 +00:00
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a0a0b0ff2c use nlevitt brozzler branch of youtube-dl 2015-07-28 01:10:39 +00:00
Noah Levitt
060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) 2015-07-24 01:40:35 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) 2015-07-23 00:44:33 +00:00
Noah Levitt
5d5151584c fix another dumb little bug in handling exceptions from youtube_dl 2015-07-23 00:41:26 +00:00
Noah Levitt
85a863b1e3 change argument to --amqp-url for clarity and consistency 2015-07-23 00:39:57 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
f00571f7bd fix youtube-dl exception handling 2015-07-22 00:53:39 +00:00
Noah Levitt
83a8e7cbe5 fix bug when --extra-header switch is not supplied 2015-07-21 20:39:41 +00:00
Noah Levitt
f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening 2015-07-21 20:39:19 +00:00
Noah Levitt
88f352efea use new fork of youtube-dl with support for extra http headers on every request 2015-07-21 19:23:01 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
1e56bc8686 add only one site at a time, specify settings with command line switches 2015-07-21 06:32:00 +00:00
Noah Levitt
38ddfe498d require my "tweaks" branch of websocket-client 2015-07-20 16:06:47 -07:00
Noah Levitt
dc04048d50 add some info to the readme 2015-07-20 12:00:14 -07:00
Noah Levitt
2f28f00a09 make putmeta requests respect site configured extra_headers 2015-07-17 16:52:06 -07:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
c178ed1950 fix buglet 2015-07-16 18:43:14 -07:00
Noah Levitt
a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix 2015-07-16 18:39:29 -07:00