1044 Commits

Author SHA1 Message Date
Noah Levitt
5fe2805285 fix bug claiming site, looks like there could be a race condition with other worker claiming the same site 2015-09-04 01:36:29 +00:00
Noah Levitt
3c23aa8fd4 finally, the jobs table 2015-09-03 01:05:03 +00:00
Noah Levitt
6cda4739b8 log exception when thread dies (seems to be dying silently sometimes) 2015-09-03 01:04:41 +00:00
Noah Levitt
839bf6f4ae script to help with starting/restarting/etc in my dev environment 2015-09-03 01:03:19 +00:00
Noah Levitt
f334107b47 support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2015-08-28 00:37:26 +00:00
Noah Levitt
cf91fb1377 Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.

This reverts commit 78ca0701651c35bda69122ddf652cbb8d95daeb0.
2015-08-26 19:44:04 +00:00
Noah Levitt
78ca070165 use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily 2015-08-26 19:22:59 +00:00
Noah Levitt
efa640c640 refactor to simplify starting new job from code 2015-08-25 19:52:33 +00:00
Noah Levitt
68de85022a there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb 2015-08-21 17:55:29 +00:00
Noah Levitt
231d019659 use nlevitt fork of surt library for less stupid handling of mailto: urls, etc 2015-08-20 21:23:59 +00:00
Noah Levitt
ee50818dca if database already exists but tables don't, just create them 2015-08-20 21:23:08 +00:00
Noah Levitt
3af1e10e13 make it work again, and list discovered outlinks 2015-08-20 21:22:08 +00:00
Noah Levitt
8b45d7eb69 since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch 2015-08-19 22:50:04 +00:00
Noah Levitt
ad543e6134 enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect 2015-08-19 20:16:25 +00:00
Noah Levitt
ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception 2015-08-19 20:04:46 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py 2015-08-19 18:45:19 +00:00
Noah Levitt
b8506a2ab4 rename "db" to "frontier" 2015-08-19 17:47:05 +00:00
Noah Levitt
cd3a644298 switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check 2015-08-19 00:04:20 +00:00
Noah Levitt
382c826678 rethinkdb connection per request, to server chosen randomly from list 2015-08-18 23:47:28 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
e6fbf0e2e9 rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al 2015-08-17 22:48:25 +00:00
Noah Levitt
6b6583e63a more notes on choosing a db 2015-08-13 01:01:35 +00:00
Noah Levitt
e68c98e66d brozzle a site for 5 minutes at a time instead of 1 for now 2015-08-11 18:15:16 +00:00
Noah Levitt
fc75e18928 handle "aw snap" or "he's dead jim" from chrome 2015-08-11 18:14:53 +00:00
Noah Levitt
3d70776ce3 some thoughts on distributed database 2015-08-11 18:06:58 +00:00
Noah Levitt
ce154fc3db more robustness improvements 2015-08-10 20:11:46 +00:00
Noah Levitt
e96b16e19a support for max_hops scope rule 2015-08-07 22:36:39 +00:00
Noah Levitt
a47292dab5 thread to read and selectively log output from chrome 2015-08-07 22:36:07 +00:00
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
4624f47402 Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
2015-08-03 12:39:26 -07:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
f5acb6c34b make requests library dependency explicity 2015-08-01 01:25:07 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00
Lauren Ko
d4a783285e Adds routing_key to queue Queue creation 2015-07-31 14:15:18 -05:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
8366bd2d66 refactor to simplify run() 2015-07-28 01:12:41 +00:00
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a0a0b0ff2c use nlevitt brozzler branch of youtube-dl 2015-07-28 01:10:39 +00:00
Noah Levitt
060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) 2015-07-24 01:40:35 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) 2015-07-23 00:44:33 +00:00
Noah Levitt
5d5151584c fix another dumb little bug in handling exceptions from youtube_dl 2015-07-23 00:41:26 +00:00
Noah Levitt
85a863b1e3 change argument to --amqp-url for clarity and consistency 2015-07-23 00:39:57 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
f00571f7bd fix youtube-dl exception handling 2015-07-22 00:53:39 +00:00
Noah Levitt
83a8e7cbe5 fix bug when --extra-header switch is not supplied 2015-07-21 20:39:41 +00:00