Noah Levitt
|
70308c10f4
|
shouldn't have local paths as requirements
|
2015-09-15 18:07:47 +00:00 |
|
Noah Levitt
|
b30cc2d68b
|
simpler implementation for https://github.com/internetarchive/umbra/pull/42/files
|
2015-09-14 17:57:01 -07:00 |
|
Noah Levitt
|
dc9d1a4959
|
detecting job finish seems to be working now
|
2015-09-10 01:38:31 +00:00 |
|
Noah Levitt
|
92a288bc35
|
detect jobs finishing! (not well tested yet)
|
2015-09-09 22:11:48 +00:00 |
|
Noah Levitt
|
72e72e03c4
|
brozzler-job-starter.py -> ait-brozzler-boss.py
|
2015-09-09 22:11:14 +00:00 |
|
Noah Levitt
|
1b94d10723
|
on reset, mark active jobs as finished
|
2015-09-08 22:38:39 +00:00 |
|
Noah Levitt
|
290ea433a5
|
save full size screenshot as jpeg too
|
2015-09-08 22:37:35 +00:00 |
|
Noah Levitt
|
9698b0f847
|
create thumbnail of screenshot and send to warcprox
|
2015-09-07 06:27:21 +00:00 |
|
Noah Levitt
|
565ab5f936
|
save screenshots with new scheme url screenshot:..., WARC-Type:resource
|
2015-09-07 00:26:37 +00:00 |
|
Noah Levitt
|
993ae6a833
|
run ait5 partner webapp; consolidate "status" and "fullstatus"
|
2015-09-04 21:02:33 +00:00 |
|
Noah Levitt
|
5fe2805285
|
fix bug claiming site, looks like there could be a race condition with other worker claiming the same site
|
2015-09-04 01:36:29 +00:00 |
|
Noah Levitt
|
3c23aa8fd4
|
finally, the jobs table
|
2015-09-03 01:05:03 +00:00 |
|
Noah Levitt
|
6cda4739b8
|
log exception when thread dies (seems to be dying silently sometimes)
|
2015-09-03 01:04:41 +00:00 |
|
Noah Levitt
|
839bf6f4ae
|
script to help with starting/restarting/etc in my dev environment
|
2015-09-03 01:03:19 +00:00 |
|
Noah Levitt
|
f334107b47
|
support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2015-08-28 00:37:26 +00:00 |
|
Noah Levitt
|
cf91fb1377
|
Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.
This reverts commit 78ca0701651c35bda69122ddf652cbb8d95daeb0.
|
2015-08-26 19:44:04 +00:00 |
|
Noah Levitt
|
78ca070165
|
use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily
|
2015-08-26 19:22:59 +00:00 |
|
Noah Levitt
|
efa640c640
|
refactor to simplify starting new job from code
|
2015-08-25 19:52:33 +00:00 |
|
Noah Levitt
|
68de85022a
|
there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb
|
2015-08-21 17:55:29 +00:00 |
|
Noah Levitt
|
231d019659
|
use nlevitt fork of surt library for less stupid handling of mailto: urls, etc
|
2015-08-20 21:23:59 +00:00 |
|
Noah Levitt
|
ee50818dca
|
if database already exists but tables don't, just create them
|
2015-08-20 21:23:08 +00:00 |
|
Noah Levitt
|
3af1e10e13
|
make it work again, and list discovered outlinks
|
2015-08-20 21:22:08 +00:00 |
|
Noah Levitt
|
8b45d7eb69
|
since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch
|
2015-08-19 22:50:04 +00:00 |
|
Noah Levitt
|
ad543e6134
|
enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect
|
2015-08-19 20:16:25 +00:00 |
|
Noah Levitt
|
ddce1cdc71
|
fix mistakenly removed import; try to shut down chrome in case of unexpected exception
|
2015-08-19 20:04:46 +00:00 |
|
Noah Levitt
|
2533229fa1
|
add __all__ to modules
|
2015-08-19 19:01:28 +00:00 |
|
Noah Levitt
|
b7df0a1f37
|
make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py
|
2015-08-19 18:45:19 +00:00 |
|
Noah Levitt
|
b8506a2ab4
|
rename "db" to "frontier"
|
2015-08-19 17:47:05 +00:00 |
|
Noah Levitt
|
cd3a644298
|
switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check
|
2015-08-19 00:04:20 +00:00 |
|
Noah Levitt
|
382c826678
|
rethinkdb connection per request, to server chosen randomly from list
|
2015-08-18 23:47:28 +00:00 |
|
Noah Levitt
|
a878730e02
|
goodbye sqlite and rabbitmq, hello rethinkdb
|
2015-08-18 21:44:54 +00:00 |
|
Noah Levitt
|
e6fbf0e2e9
|
rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al
|
2015-08-17 22:48:25 +00:00 |
|
Noah Levitt
|
6b6583e63a
|
more notes on choosing a db
|
2015-08-13 01:01:35 +00:00 |
|
Noah Levitt
|
e68c98e66d
|
brozzle a site for 5 minutes at a time instead of 1 for now
|
2015-08-11 18:15:16 +00:00 |
|
Noah Levitt
|
fc75e18928
|
handle "aw snap" or "he's dead jim" from chrome
|
2015-08-11 18:14:53 +00:00 |
|
Noah Levitt
|
3d70776ce3
|
some thoughts on distributed database
|
2015-08-11 18:06:58 +00:00 |
|
Noah Levitt
|
ce154fc3db
|
more robustness improvements
|
2015-08-10 20:11:46 +00:00 |
|
Noah Levitt
|
e96b16e19a
|
support for max_hops scope rule
|
2015-08-07 22:36:39 +00:00 |
|
Noah Levitt
|
a47292dab5
|
thread to read and selectively log output from chrome
|
2015-08-07 22:36:07 +00:00 |
|
Noah Levitt
|
2a7a0b7c30
|
little fix, tweak
|
2015-08-05 00:17:43 +00:00 |
|
Noah Levitt
|
b6beac3807
|
new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file
|
2015-08-04 19:52:01 +00:00 |
|
Noah Levitt
|
4624f47402
|
Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
|
2015-08-03 12:39:26 -07:00 |
|
Noah Levitt
|
e6eeca6ae2
|
handle 420 Reached limit when fetching robots in brozzler-hq
|
2015-08-01 17:54:29 +00:00 |
|
Noah Levitt
|
511e19ff4d
|
handle 420 "Limit reached" when browser receives it
|
2015-08-01 01:26:59 +00:00 |
|
Noah Levitt
|
f5acb6c34b
|
make requests library dependency explicity
|
2015-08-01 01:25:07 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Lauren Ko
|
d4a783285e
|
Adds routing_key to queue Queue creation
|
2015-07-31 14:15:18 -05:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
8366bd2d66
|
refactor to simplify run()
|
2015-07-28 01:12:41 +00:00 |
|
Noah Levitt
|
5c701abb36
|
reject urls with scheme other than http/https (for now)
|
2015-07-28 01:11:26 +00:00 |
|