brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 16:49:56 -05:00

Author	SHA1	Message	Date
Noah Levitt	e6eeca6ae2	handle 420 Reached limit when fetching robots in brozzler-hq	2015-08-01 17:54:29 +00:00
Noah Levitt	511e19ff4d	handle 420 "Limit reached" when browser receives it	2015-08-01 01:26:59 +00:00
Noah Levitt	f5acb6c34b	make requests library dependency explicity	2015-08-01 01:25:07 +00:00
Noah Levitt	7b98af7d9f	handle reached limit response from warcprox	2015-08-01 00:09:57 +00:00
Lauren Ko	d4a783285e	Adds routing_key to queue Queue creation	2015-07-31 14:15:18 -05:00
Noah Levitt	11fbbc9d49	change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc	2015-07-31 00:03:13 +00:00
Noah Levitt	8366bd2d66	refactor to simplify run()	2015-07-28 01:12:41 +00:00
Noah Levitt	5c701abb36	reject urls with scheme other than http/https (for now)	2015-07-28 01:11:26 +00:00
Noah Levitt	a0a0b0ff2c	use nlevitt brozzler branch of youtube-dl	2015-07-28 01:10:39 +00:00
Noah Levitt	060b796d78	avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true)	2015-07-24 01:40:35 +00:00
Noah Levitt	a04bf04307	keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over	2015-07-23 02:19:25 +00:00
Noah Levitt	4dacc0b087	new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue	2015-07-23 01:21:23 +00:00
Noah Levitt	6e6fd5dc2c	don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future)	2015-07-23 00:44:33 +00:00
Noah Levitt	5d5151584c	fix another dumb little bug in handling exceptions from youtube_dl	2015-07-23 00:41:26 +00:00
Noah Levitt	85a863b1e3	change argument to --amqp-url for clarity and consistency	2015-07-23 00:39:57 +00:00
Noah Levitt	6a09f2095c	handle exceptions in robots.txt fetching/parsing	2015-07-22 00:54:49 +00:00
Noah Levitt	f00571f7bd	fix youtube-dl exception handling	2015-07-22 00:53:39 +00:00
Noah Levitt	83a8e7cbe5	fix bug when --extra-header switch is not supplied	2015-07-21 20:39:41 +00:00
Noah Levitt	f9c049a69e	navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening	2015-07-21 20:39:19 +00:00
Noah Levitt	88f352efea	use new fork of youtube-dl with support for extra http headers on every request	2015-07-21 19:23:01 +00:00
Noah Levitt	b5cb94fc8b	some additional logging and error handling to avoid mysterious messages	2015-07-21 06:33:02 +00:00
Noah Levitt	1e56bc8686	add only one site at a time, specify settings with command line switches	2015-07-21 06:32:00 +00:00
Noah Levitt	38ddfe498d	require my "tweaks" branch of websocket-client	2015-07-20 16:06:47 -07:00
Noah Levitt	dc04048d50	add some info to the readme	2015-07-20 12:00:14 -07:00
Noah Levitt	2f28f00a09	make putmeta requests respect site configured extra_headers	2015-07-17 16:52:06 -07:00
Noah Levitt	2ba5bd4d4b	support adding extra http request headers	2015-07-17 13:45:27 -07:00
Noah Levitt	c178ed1950	fix buglet	2015-07-16 18:43:14 -07:00
Noah Levitt	a54e60dbaf	change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix	2015-07-16 18:39:29 -07:00
Noah Levitt	d2650a2547	update scope if seed redirects	2015-07-16 18:27:47 -07:00
Noah Levitt	140a441eb5	honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs	2015-07-16 17:19:12 -07:00
Noah Levitt	e04247c3f7	add support for supplying json blob defining site with configuration to brozzler-add-site	2015-07-16 14:48:01 -07:00
Noah Levitt	6b2ee9faee	chmod -x worker.py	2015-07-15 18:03:49 -07:00
Noah Levitt	f2bc7ec271	refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling	2015-07-15 18:03:03 -07:00
Noah Levitt	a9c51edd84	robots cache per site, and some so far unused support for site level configuration	2015-07-15 17:44:42 -07:00
Noah Levitt	efa3cd6269	don't set http_proxy environment variable, because it affects things we don't want it to	2015-07-15 17:33:29 -07:00
Noah Levitt	923cd98652	save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)	2015-07-15 16:32:02 -07:00
Noah Levitt	5aea76ab6d	refactor worker code into worker module	2015-07-15 15:42:40 -07:00
Noah Levitt	7b92ba39c7	avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)	2015-07-15 14:33:22 -07:00
Noah Levitt	9b13f0c34c	refactor hq code into hq module	2015-07-15 14:27:21 -07:00
Noah Levitt	4cfb287397	refactor hq code into hq module	2015-07-15 14:26:48 -07:00
Noah Levitt	9b5da57d7e	initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled	2015-07-14 18:57:45 -07:00
Noah Levitt	fd0c3322ee	update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff	2015-07-13 17:09:39 -07:00
Noah Levitt	3eff099b16	determine if youtube-dl can do something with a url	2015-07-13 16:40:56 -07:00
Noah Levitt	6470a8ef26	sigquit dumps thread traces	2015-07-13 15:57:14 -07:00
Noah Levitt	18ca996216	rudimentary robots.txt support	2015-07-13 15:56:54 -07:00
Noah Levitt	eb74967fed	brozzler-worker round-robins sites needing crawling	2015-07-13 12:13:41 -07:00
Noah Levitt	ddd764cac5	brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)	2015-07-11 23:07:47 -07:00
Noah Levitt	b0f3b8a5e3	clean shutdown for brozzler-hq	2015-07-11 18:18:54 -07:00
Noah Levitt	384120928c	set in_progress=0 for completed url	2015-07-11 13:24:38 -07:00
Noah Levitt	610f9c8cf4	add missing file hq.py, improve some logging, fix little race condition bug	2015-07-11 13:09:45 -07:00

... 17 18 19 20 21 ...

1162 Commits