Noah Levitt
|
ddce1cdc71
|
fix mistakenly removed import; try to shut down chrome in case of unexpected exception
|
2015-08-19 20:04:46 +00:00 |
|
Noah Levitt
|
2533229fa1
|
add __all__ to modules
|
2015-08-19 19:01:28 +00:00 |
|
Noah Levitt
|
b7df0a1f37
|
make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py
|
2015-08-19 18:45:19 +00:00 |
|
Noah Levitt
|
b8506a2ab4
|
rename "db" to "frontier"
|
2015-08-19 17:47:05 +00:00 |
|
Noah Levitt
|
cd3a644298
|
switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check
|
2015-08-19 00:04:20 +00:00 |
|
Noah Levitt
|
382c826678
|
rethinkdb connection per request, to server chosen randomly from list
|
2015-08-18 23:47:28 +00:00 |
|
Noah Levitt
|
a878730e02
|
goodbye sqlite and rabbitmq, hello rethinkdb
|
2015-08-18 21:44:54 +00:00 |
|
Noah Levitt
|
e6fbf0e2e9
|
rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al
|
2015-08-17 22:48:25 +00:00 |
|
Noah Levitt
|
6b6583e63a
|
more notes on choosing a db
|
2015-08-13 01:01:35 +00:00 |
|
Noah Levitt
|
e68c98e66d
|
brozzle a site for 5 minutes at a time instead of 1 for now
|
2015-08-11 18:15:16 +00:00 |
|
Noah Levitt
|
fc75e18928
|
handle "aw snap" or "he's dead jim" from chrome
|
2015-08-11 18:14:53 +00:00 |
|
Noah Levitt
|
3d70776ce3
|
some thoughts on distributed database
|
2015-08-11 18:06:58 +00:00 |
|
Noah Levitt
|
ce154fc3db
|
more robustness improvements
|
2015-08-10 20:11:46 +00:00 |
|
Noah Levitt
|
e96b16e19a
|
support for max_hops scope rule
|
2015-08-07 22:36:39 +00:00 |
|
Noah Levitt
|
a47292dab5
|
thread to read and selectively log output from chrome
|
2015-08-07 22:36:07 +00:00 |
|
Noah Levitt
|
2a7a0b7c30
|
little fix, tweak
|
2015-08-05 00:17:43 +00:00 |
|
Noah Levitt
|
b6beac3807
|
new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file
|
2015-08-04 19:52:01 +00:00 |
|
Noah Levitt
|
4624f47402
|
Merge pull request #41 from ldko/add-routing-key
Adds routing_key to Queue creation
|
2015-08-03 12:39:26 -07:00 |
|
Noah Levitt
|
e6eeca6ae2
|
handle 420 Reached limit when fetching robots in brozzler-hq
|
2015-08-01 17:54:29 +00:00 |
|
Noah Levitt
|
511e19ff4d
|
handle 420 "Limit reached" when browser receives it
|
2015-08-01 01:26:59 +00:00 |
|
Noah Levitt
|
f5acb6c34b
|
make requests library dependency explicity
|
2015-08-01 01:25:07 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Lauren Ko
|
d4a783285e
|
Adds routing_key to queue Queue creation
|
2015-07-31 14:15:18 -05:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
8366bd2d66
|
refactor to simplify run()
|
2015-07-28 01:12:41 +00:00 |
|
Noah Levitt
|
5c701abb36
|
reject urls with scheme other than http/https (for now)
|
2015-07-28 01:11:26 +00:00 |
|
Noah Levitt
|
a0a0b0ff2c
|
use nlevitt brozzler branch of youtube-dl
|
2015-07-28 01:10:39 +00:00 |
|
Noah Levitt
|
060b796d78
|
avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true)
|
2015-07-24 01:40:35 +00:00 |
|
Noah Levitt
|
a04bf04307
|
keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over
|
2015-07-23 02:19:25 +00:00 |
|
Noah Levitt
|
4dacc0b087
|
new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue
|
2015-07-23 01:21:23 +00:00 |
|
Noah Levitt
|
6e6fd5dc2c
|
don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future)
|
2015-07-23 00:44:33 +00:00 |
|
Noah Levitt
|
5d5151584c
|
fix another dumb little bug in handling exceptions from youtube_dl
|
2015-07-23 00:41:26 +00:00 |
|
Noah Levitt
|
85a863b1e3
|
change argument to --amqp-url for clarity and consistency
|
2015-07-23 00:39:57 +00:00 |
|
Noah Levitt
|
6a09f2095c
|
handle exceptions in robots.txt fetching/parsing
|
2015-07-22 00:54:49 +00:00 |
|
Noah Levitt
|
f00571f7bd
|
fix youtube-dl exception handling
|
2015-07-22 00:53:39 +00:00 |
|
Noah Levitt
|
83a8e7cbe5
|
fix bug when --extra-header switch is not supplied
|
2015-07-21 20:39:41 +00:00 |
|
Noah Levitt
|
f9c049a69e
|
navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening
|
2015-07-21 20:39:19 +00:00 |
|
Noah Levitt
|
88f352efea
|
use new fork of youtube-dl with support for extra http headers on every request
|
2015-07-21 19:23:01 +00:00 |
|
Noah Levitt
|
b5cb94fc8b
|
some additional logging and error handling to avoid mysterious messages
|
2015-07-21 06:33:02 +00:00 |
|
Noah Levitt
|
1e56bc8686
|
add only one site at a time, specify settings with command line switches
|
2015-07-21 06:32:00 +00:00 |
|
Noah Levitt
|
38ddfe498d
|
require my "tweaks" branch of websocket-client
|
2015-07-20 16:06:47 -07:00 |
|
Noah Levitt
|
dc04048d50
|
add some info to the readme
|
2015-07-20 12:00:14 -07:00 |
|
Noah Levitt
|
2f28f00a09
|
make putmeta requests respect site configured extra_headers
|
2015-07-17 16:52:06 -07:00 |
|
Noah Levitt
|
2ba5bd4d4b
|
support adding extra http request headers
|
2015-07-17 13:45:27 -07:00 |
|
Noah Levitt
|
c178ed1950
|
fix buglet
|
2015-07-16 18:43:14 -07:00 |
|
Noah Levitt
|
a54e60dbaf
|
change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix
|
2015-07-16 18:39:29 -07:00 |
|
Noah Levitt
|
d2650a2547
|
update scope if seed redirects
|
2015-07-16 18:27:47 -07:00 |
|
Noah Levitt
|
140a441eb5
|
honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs
|
2015-07-16 17:19:12 -07:00 |
|
Noah Levitt
|
e04247c3f7
|
add support for supplying json blob defining site with configuration to brozzler-add-site
|
2015-07-16 14:48:01 -07:00 |
|
Noah Levitt
|
6b2ee9faee
|
chmod -x worker.py
|
2015-07-15 18:03:49 -07:00 |
|