1308 Commits

Author SHA1 Message Date
Hunter Stern
5ccc535f51 More changes 2015-09-16 09:23:13 -07:00
Hunter Stern
3467670900 More changes for handling psu24 site 2015-09-15 18:03:08 -07:00
Noah Levitt
5a6cbf01da Dockerfile for brozzler worker 2015-09-15 23:02:37 +00:00
Hunter Stern
ea41653c44 Pulled in changes from https://github.com/nlevitt/umbra/tree/aitfive-451-alt 2015-09-15 11:53:53 -07:00
Noah Levitt
70308c10f4 shouldn't have local paths as requirements 2015-09-15 18:07:47 +00:00
Noah Levitt
b30cc2d68b simpler implementation for https://github.com/internetarchive/umbra/pull/42/files 2015-09-14 17:57:01 -07:00
Noah Levitt
dc9d1a4959 detecting job finish seems to be working now 2015-09-10 01:38:31 +00:00
Noah Levitt
92a288bc35 detect jobs finishing! (not well tested yet) 2015-09-09 22:11:48 +00:00
Noah Levitt
72e72e03c4 brozzler-job-starter.py -> ait-brozzler-boss.py 2015-09-09 22:11:14 +00:00
Noah Levitt
1b94d10723 on reset, mark active jobs as finished 2015-09-08 22:38:39 +00:00
Noah Levitt
290ea433a5 save full size screenshot as jpeg too 2015-09-08 22:37:35 +00:00
Noah Levitt
9698b0f847 create thumbnail of screenshot and send to warcprox 2015-09-07 06:27:21 +00:00
Noah Levitt
565ab5f936 save screenshots with new scheme url screenshot:..., WARC-Type:resource 2015-09-07 00:26:37 +00:00
Noah Levitt
993ae6a833 run ait5 partner webapp; consolidate "status" and "fullstatus" 2015-09-04 21:02:33 +00:00
Noah Levitt
5fe2805285 fix bug claiming site, looks like there could be a race condition with other worker claiming the same site 2015-09-04 01:36:29 +00:00
Noah Levitt
3c23aa8fd4 finally, the jobs table 2015-09-03 01:05:03 +00:00
Noah Levitt
6cda4739b8 log exception when thread dies (seems to be dying silently sometimes) 2015-09-03 01:04:41 +00:00
Noah Levitt
839bf6f4ae script to help with starting/restarting/etc in my dev environment 2015-09-03 01:03:19 +00:00
Noah Levitt
f334107b47 support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2015-08-28 00:37:26 +00:00
Noah Levitt
cf91fb1377 Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.

This reverts commit 78ca0701651c35bda69122ddf652cbb8d95daeb0.
2015-08-26 19:44:04 +00:00
Noah Levitt
78ca070165 use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily 2015-08-26 19:22:59 +00:00
Noah Levitt
efa640c640 refactor to simplify starting new job from code 2015-08-25 19:52:33 +00:00
Noah Levitt
68de85022a there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb 2015-08-21 17:55:29 +00:00
Noah Levitt
231d019659 use nlevitt fork of surt library for less stupid handling of mailto: urls, etc 2015-08-20 21:23:59 +00:00
Noah Levitt
ee50818dca if database already exists but tables don't, just create them 2015-08-20 21:23:08 +00:00
Noah Levitt
3af1e10e13 make it work again, and list discovered outlinks 2015-08-20 21:22:08 +00:00
Noah Levitt
8b45d7eb69 since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch 2015-08-19 22:50:04 +00:00
Noah Levitt
ad543e6134 enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect 2015-08-19 20:16:25 +00:00
Noah Levitt
ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception 2015-08-19 20:04:46 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py 2015-08-19 18:45:19 +00:00
Noah Levitt
b8506a2ab4 rename "db" to "frontier" 2015-08-19 17:47:05 +00:00
Noah Levitt
cd3a644298 switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check 2015-08-19 00:04:20 +00:00
Noah Levitt
382c826678 rethinkdb connection per request, to server chosen randomly from list 2015-08-18 23:47:28 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
e6fbf0e2e9 rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al 2015-08-17 22:48:25 +00:00
Noah Levitt
6b6583e63a more notes on choosing a db 2015-08-13 01:01:35 +00:00
Noah Levitt
e68c98e66d brozzle a site for 5 minutes at a time instead of 1 for now 2015-08-11 18:15:16 +00:00
Noah Levitt
fc75e18928 handle "aw snap" or "he's dead jim" from chrome 2015-08-11 18:14:53 +00:00
Noah Levitt
3d70776ce3 some thoughts on distributed database 2015-08-11 18:06:58 +00:00
Noah Levitt
ce154fc3db more robustness improvements 2015-08-10 20:11:46 +00:00
Noah Levitt
e96b16e19a support for max_hops scope rule 2015-08-07 22:36:39 +00:00
Noah Levitt
a47292dab5 thread to read and selectively log output from chrome 2015-08-07 22:36:07 +00:00
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
4624f47402 Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
2015-08-03 12:39:26 -07:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
f5acb6c34b make requests library dependency explicity 2015-08-01 01:25:07 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00