Noah Levitt
|
e6eeca6ae2
|
handle 420 Reached limit when fetching robots in brozzler-hq
|
2015-08-01 17:54:29 +00:00 |
|
Noah Levitt
|
511e19ff4d
|
handle 420 "Limit reached" when browser receives it
|
2015-08-01 01:26:59 +00:00 |
|
Noah Levitt
|
f5acb6c34b
|
make requests library dependency explicity
|
2015-08-01 01:25:07 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Lauren Ko
|
d4a783285e
|
Adds routing_key to queue Queue creation
|
2015-07-31 14:15:18 -05:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
8366bd2d66
|
refactor to simplify run()
|
2015-07-28 01:12:41 +00:00 |
|
Noah Levitt
|
5c701abb36
|
reject urls with scheme other than http/https (for now)
|
2015-07-28 01:11:26 +00:00 |
|
Noah Levitt
|
a0a0b0ff2c
|
use nlevitt brozzler branch of youtube-dl
|
2015-07-28 01:10:39 +00:00 |
|
Noah Levitt
|
060b796d78
|
avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true)
|
2015-07-24 01:40:35 +00:00 |
|
Noah Levitt
|
a04bf04307
|
keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over
|
2015-07-23 02:19:25 +00:00 |
|
Noah Levitt
|
4dacc0b087
|
new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue
|
2015-07-23 01:21:23 +00:00 |
|
Noah Levitt
|
6e6fd5dc2c
|
don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future)
|
2015-07-23 00:44:33 +00:00 |
|
Noah Levitt
|
5d5151584c
|
fix another dumb little bug in handling exceptions from youtube_dl
|
2015-07-23 00:41:26 +00:00 |
|
Noah Levitt
|
85a863b1e3
|
change argument to --amqp-url for clarity and consistency
|
2015-07-23 00:39:57 +00:00 |
|
Noah Levitt
|
6a09f2095c
|
handle exceptions in robots.txt fetching/parsing
|
2015-07-22 00:54:49 +00:00 |
|
Noah Levitt
|
f00571f7bd
|
fix youtube-dl exception handling
|
2015-07-22 00:53:39 +00:00 |
|
Noah Levitt
|
83a8e7cbe5
|
fix bug when --extra-header switch is not supplied
|
2015-07-21 20:39:41 +00:00 |
|
Noah Levitt
|
f9c049a69e
|
navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening
|
2015-07-21 20:39:19 +00:00 |
|
Noah Levitt
|
88f352efea
|
use new fork of youtube-dl with support for extra http headers on every request
|
2015-07-21 19:23:01 +00:00 |
|
Noah Levitt
|
b5cb94fc8b
|
some additional logging and error handling to avoid mysterious messages
|
2015-07-21 06:33:02 +00:00 |
|
Noah Levitt
|
1e56bc8686
|
add only one site at a time, specify settings with command line switches
|
2015-07-21 06:32:00 +00:00 |
|
Noah Levitt
|
38ddfe498d
|
require my "tweaks" branch of websocket-client
|
2015-07-20 16:06:47 -07:00 |
|
Noah Levitt
|
dc04048d50
|
add some info to the readme
|
2015-07-20 12:00:14 -07:00 |
|
Noah Levitt
|
2f28f00a09
|
make putmeta requests respect site configured extra_headers
|
2015-07-17 16:52:06 -07:00 |
|
Noah Levitt
|
2ba5bd4d4b
|
support adding extra http request headers
|
2015-07-17 13:45:27 -07:00 |
|
Noah Levitt
|
c178ed1950
|
fix buglet
|
2015-07-16 18:43:14 -07:00 |
|
Noah Levitt
|
a54e60dbaf
|
change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix
|
2015-07-16 18:39:29 -07:00 |
|
Noah Levitt
|
d2650a2547
|
update scope if seed redirects
|
2015-07-16 18:27:47 -07:00 |
|
Noah Levitt
|
140a441eb5
|
honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs
|
2015-07-16 17:19:12 -07:00 |
|
Noah Levitt
|
e04247c3f7
|
add support for supplying json blob defining site with configuration to brozzler-add-site
|
2015-07-16 14:48:01 -07:00 |
|
Noah Levitt
|
6b2ee9faee
|
chmod -x worker.py
|
2015-07-15 18:03:49 -07:00 |
|
Noah Levitt
|
f2bc7ec271
|
refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling
|
2015-07-15 18:03:03 -07:00 |
|
Noah Levitt
|
a9c51edd84
|
robots cache per site, and some so far unused support for site level configuration
|
2015-07-15 17:44:42 -07:00 |
|
Noah Levitt
|
efa3cd6269
|
don't set http_proxy environment variable, because it affects things we don't want it to
|
2015-07-15 17:33:29 -07:00 |
|
Noah Levitt
|
923cd98652
|
save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)
|
2015-07-15 16:32:02 -07:00 |
|
Noah Levitt
|
5aea76ab6d
|
refactor worker code into worker module
|
2015-07-15 15:42:40 -07:00 |
|
Noah Levitt
|
7b92ba39c7
|
avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)
|
2015-07-15 14:33:22 -07:00 |
|
Noah Levitt
|
9b13f0c34c
|
refactor hq code into hq module
|
2015-07-15 14:27:21 -07:00 |
|
Noah Levitt
|
4cfb287397
|
refactor hq code into hq module
|
2015-07-15 14:26:48 -07:00 |
|
Noah Levitt
|
9b5da57d7e
|
initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled
|
2015-07-14 18:57:45 -07:00 |
|
Noah Levitt
|
fd0c3322ee
|
update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff
|
2015-07-13 17:09:39 -07:00 |
|
Noah Levitt
|
3eff099b16
|
determine if youtube-dl can do something with a url
|
2015-07-13 16:40:56 -07:00 |
|
Noah Levitt
|
6470a8ef26
|
sigquit dumps thread traces
|
2015-07-13 15:57:14 -07:00 |
|
Noah Levitt
|
18ca996216
|
rudimentary robots.txt support
|
2015-07-13 15:56:54 -07:00 |
|
Noah Levitt
|
eb74967fed
|
brozzler-worker round-robins sites needing crawling
|
2015-07-13 12:13:41 -07:00 |
|
Noah Levitt
|
ddd764cac5
|
brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)
|
2015-07-11 23:07:47 -07:00 |
|
Noah Levitt
|
b0f3b8a5e3
|
clean shutdown for brozzler-hq
|
2015-07-11 18:18:54 -07:00 |
|
Noah Levitt
|
384120928c
|
set in_progress=0 for completed url
|
2015-07-11 13:24:38 -07:00 |
|
Noah Levitt
|
610f9c8cf4
|
add missing file hq.py, improve some logging, fix little race condition bug
|
2015-07-11 13:09:45 -07:00 |
|