1555 Commits

Author SHA1 Message Date
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a0a0b0ff2c use nlevitt brozzler branch of youtube-dl 2015-07-28 01:10:39 +00:00
Noah Levitt
060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) 2015-07-24 01:40:35 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) 2015-07-23 00:44:33 +00:00
Noah Levitt
5d5151584c fix another dumb little bug in handling exceptions from youtube_dl 2015-07-23 00:41:26 +00:00
Noah Levitt
85a863b1e3 change argument to --amqp-url for clarity and consistency 2015-07-23 00:39:57 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
f00571f7bd fix youtube-dl exception handling 2015-07-22 00:53:39 +00:00
Noah Levitt
83a8e7cbe5 fix bug when --extra-header switch is not supplied 2015-07-21 20:39:41 +00:00
Noah Levitt
f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening 2015-07-21 20:39:19 +00:00
Noah Levitt
88f352efea use new fork of youtube-dl with support for extra http headers on every request 2015-07-21 19:23:01 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
1e56bc8686 add only one site at a time, specify settings with command line switches 2015-07-21 06:32:00 +00:00
Noah Levitt
38ddfe498d require my "tweaks" branch of websocket-client 2015-07-20 16:06:47 -07:00
Noah Levitt
dc04048d50 add some info to the readme 2015-07-20 12:00:14 -07:00
Noah Levitt
2f28f00a09 make putmeta requests respect site configured extra_headers 2015-07-17 16:52:06 -07:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
c178ed1950 fix buglet 2015-07-16 18:43:14 -07:00
Noah Levitt
a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix 2015-07-16 18:39:29 -07:00
Noah Levitt
d2650a2547 update scope if seed redirects 2015-07-16 18:27:47 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site 2015-07-16 14:48:01 -07:00
Noah Levitt
6b2ee9faee chmod -x worker.py 2015-07-15 18:03:49 -07:00
Noah Levitt
f2bc7ec271 refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling 2015-07-15 18:03:03 -07:00
Noah Levitt
a9c51edd84 robots cache per site, and some so far unused support for site level configuration 2015-07-15 17:44:42 -07:00
Noah Levitt
efa3cd6269 don't set http_proxy environment variable, because it affects things we don't want it to 2015-07-15 17:33:29 -07:00
Noah Levitt
923cd98652 save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl) 2015-07-15 16:32:02 -07:00
Noah Levitt
5aea76ab6d refactor worker code into worker module 2015-07-15 15:42:40 -07:00
Noah Levitt
7b92ba39c7 avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately) 2015-07-15 14:33:22 -07:00
Noah Levitt
9b13f0c34c refactor hq code into hq module 2015-07-15 14:27:21 -07:00
Noah Levitt
4cfb287397 refactor hq code into hq module 2015-07-15 14:26:48 -07:00
Noah Levitt
9b5da57d7e initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled 2015-07-14 18:57:45 -07:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Noah Levitt
3eff099b16 determine if youtube-dl can do something with a url 2015-07-13 16:40:56 -07:00
Noah Levitt
6470a8ef26 sigquit dumps thread traces 2015-07-13 15:57:14 -07:00
Noah Levitt
18ca996216 rudimentary robots.txt support 2015-07-13 15:56:54 -07:00
Noah Levitt
eb74967fed brozzler-worker round-robins sites needing crawling 2015-07-13 12:13:41 -07:00
Noah Levitt
ddd764cac5 brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox) 2015-07-11 23:07:47 -07:00
Noah Levitt
b0f3b8a5e3 clean shutdown for brozzler-hq 2015-07-11 18:18:54 -07:00
Noah Levitt
384120928c set in_progress=0 for completed url 2015-07-11 13:24:38 -07:00
Noah Levitt
610f9c8cf4 add missing file hq.py, improve some logging, fix little race condition bug 2015-07-11 13:09:45 -07:00
Noah Levitt
bb3561a690 check scope (on hq side), fix buglets 2015-07-11 12:33:19 -07:00
Noah Levitt
1fb336cb2e crawling outlinks not totally working 2015-07-11 02:29:19 -07:00
Noah Levitt
56a7bb7306 submit outlinks to hq 2015-07-10 21:31:41 -07:00
Noah Levitt
fd99764baa brozzler-worker partially working 2015-07-10 21:07:47 -07:00
Noah Levitt
8aa1e6715a feed seed url to the crawl url queue 2015-07-10 20:12:33 -07:00
Noah Levitt
1d068f4f86 starting work on brozzler crawl hq 2015-07-10 18:01:54 -07:00
Noah Levitt
fcc63b6675 fancier prioritization takes into account hops from seed, path depth; and clean shutdown 2015-07-09 22:35:37 -07:00