Noah Levitt
|
1b94d10723
|
on reset, mark active jobs as finished
|
2015-09-08 22:38:39 +00:00 |
|
Noah Levitt
|
290ea433a5
|
save full size screenshot as jpeg too
|
2015-09-08 22:37:35 +00:00 |
|
Noah Levitt
|
9698b0f847
|
create thumbnail of screenshot and send to warcprox
|
2015-09-07 06:27:21 +00:00 |
|
Noah Levitt
|
565ab5f936
|
save screenshots with new scheme url screenshot:..., WARC-Type:resource
|
2015-09-07 00:26:37 +00:00 |
|
Noah Levitt
|
993ae6a833
|
run ait5 partner webapp; consolidate "status" and "fullstatus"
|
2015-09-04 21:02:33 +00:00 |
|
Noah Levitt
|
5fe2805285
|
fix bug claiming site, looks like there could be a race condition with other worker claiming the same site
|
2015-09-04 01:36:29 +00:00 |
|
Noah Levitt
|
3c23aa8fd4
|
finally, the jobs table
|
2015-09-03 01:05:03 +00:00 |
|
Noah Levitt
|
6cda4739b8
|
log exception when thread dies (seems to be dying silently sometimes)
|
2015-09-03 01:04:41 +00:00 |
|
Noah Levitt
|
839bf6f4ae
|
script to help with starting/restarting/etc in my dev environment
|
2015-09-03 01:03:19 +00:00 |
|
Noah Levitt
|
f334107b47
|
support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2015-08-28 00:37:26 +00:00 |
|
Noah Levitt
|
cf91fb1377
|
Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.
This reverts commit 78ca0701651c35bda69122ddf652cbb8d95daeb0.
|
2015-08-26 19:44:04 +00:00 |
|
Noah Levitt
|
78ca070165
|
use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily
|
2015-08-26 19:22:59 +00:00 |
|
Noah Levitt
|
efa640c640
|
refactor to simplify starting new job from code
|
2015-08-25 19:52:33 +00:00 |
|
Noah Levitt
|
68de85022a
|
there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb
|
2015-08-21 17:55:29 +00:00 |
|
Noah Levitt
|
231d019659
|
use nlevitt fork of surt library for less stupid handling of mailto: urls, etc
|
2015-08-20 21:23:59 +00:00 |
|
Noah Levitt
|
ee50818dca
|
if database already exists but tables don't, just create them
|
2015-08-20 21:23:08 +00:00 |
|
Noah Levitt
|
3af1e10e13
|
make it work again, and list discovered outlinks
|
2015-08-20 21:22:08 +00:00 |
|
Noah Levitt
|
8b45d7eb69
|
since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch
|
2015-08-19 22:50:04 +00:00 |
|
Noah Levitt
|
ad543e6134
|
enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect
|
2015-08-19 20:16:25 +00:00 |
|
Noah Levitt
|
ddce1cdc71
|
fix mistakenly removed import; try to shut down chrome in case of unexpected exception
|
2015-08-19 20:04:46 +00:00 |
|
Noah Levitt
|
2533229fa1
|
add __all__ to modules
|
2015-08-19 19:01:28 +00:00 |
|
Noah Levitt
|
b7df0a1f37
|
make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py
|
2015-08-19 18:45:19 +00:00 |
|
Noah Levitt
|
b8506a2ab4
|
rename "db" to "frontier"
|
2015-08-19 17:47:05 +00:00 |
|
Noah Levitt
|
cd3a644298
|
switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check
|
2015-08-19 00:04:20 +00:00 |
|
Noah Levitt
|
382c826678
|
rethinkdb connection per request, to server chosen randomly from list
|
2015-08-18 23:47:28 +00:00 |
|
Noah Levitt
|
a878730e02
|
goodbye sqlite and rabbitmq, hello rethinkdb
|
2015-08-18 21:44:54 +00:00 |
|
Noah Levitt
|
e6fbf0e2e9
|
rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al
|
2015-08-17 22:48:25 +00:00 |
|
Noah Levitt
|
6b6583e63a
|
more notes on choosing a db
|
2015-08-13 01:01:35 +00:00 |
|
Noah Levitt
|
e68c98e66d
|
brozzle a site for 5 minutes at a time instead of 1 for now
|
2015-08-11 18:15:16 +00:00 |
|
Noah Levitt
|
fc75e18928
|
handle "aw snap" or "he's dead jim" from chrome
|
2015-08-11 18:14:53 +00:00 |
|
Noah Levitt
|
3d70776ce3
|
some thoughts on distributed database
|
2015-08-11 18:06:58 +00:00 |
|
Noah Levitt
|
ce154fc3db
|
more robustness improvements
|
2015-08-10 20:11:46 +00:00 |
|
Noah Levitt
|
e96b16e19a
|
support for max_hops scope rule
|
2015-08-07 22:36:39 +00:00 |
|
Noah Levitt
|
a47292dab5
|
thread to read and selectively log output from chrome
|
2015-08-07 22:36:07 +00:00 |
|
Noah Levitt
|
2a7a0b7c30
|
little fix, tweak
|
2015-08-05 00:17:43 +00:00 |
|
Noah Levitt
|
b6beac3807
|
new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file
|
2015-08-04 19:52:01 +00:00 |
|
Noah Levitt
|
4624f47402
|
Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
|
2015-08-03 12:39:26 -07:00 |
|
Noah Levitt
|
e6eeca6ae2
|
handle 420 Reached limit when fetching robots in brozzler-hq
|
2015-08-01 17:54:29 +00:00 |
|
Noah Levitt
|
511e19ff4d
|
handle 420 "Limit reached" when browser receives it
|
2015-08-01 01:26:59 +00:00 |
|
Noah Levitt
|
f5acb6c34b
|
make requests library dependency explicity
|
2015-08-01 01:25:07 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Lauren Ko
|
d4a783285e
|
Adds routing_key to queue Queue creation
|
2015-07-31 14:15:18 -05:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
8366bd2d66
|
refactor to simplify run()
|
2015-07-28 01:12:41 +00:00 |
|
Noah Levitt
|
5c701abb36
|
reject urls with scheme other than http/https (for now)
|
2015-07-28 01:11:26 +00:00 |
|
Noah Levitt
|
a0a0b0ff2c
|
use nlevitt brozzler branch of youtube-dl
|
2015-07-28 01:10:39 +00:00 |
|
Noah Levitt
|
060b796d78
|
avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true)
|
2015-07-24 01:40:35 +00:00 |
|
Noah Levitt
|
a04bf04307
|
keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over
|
2015-07-23 02:19:25 +00:00 |
|
Noah Levitt
|
4dacc0b087
|
new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue
|
2015-07-23 01:21:23 +00:00 |
|
Noah Levitt
|
6e6fd5dc2c
|
don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future)
|
2015-07-23 00:44:33 +00:00 |
|