brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 08:39:59 -05:00

Author	SHA1	Message	Date
Noah Levitt	5c701abb36	reject urls with scheme other than http/https (for now)	2015-07-28 01:11:26 +00:00
Noah Levitt	a0a0b0ff2c	use nlevitt brozzler branch of youtube-dl	2015-07-28 01:10:39 +00:00
Noah Levitt	060b796d78	avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true)	2015-07-24 01:40:35 +00:00
Noah Levitt	a04bf04307	keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over	2015-07-23 02:19:25 +00:00
Noah Levitt	4dacc0b087	new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue	2015-07-23 01:21:23 +00:00
Noah Levitt	6e6fd5dc2c	don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future)	2015-07-23 00:44:33 +00:00
Noah Levitt	5d5151584c	fix another dumb little bug in handling exceptions from youtube_dl	2015-07-23 00:41:26 +00:00
Noah Levitt	85a863b1e3	change argument to --amqp-url for clarity and consistency	2015-07-23 00:39:57 +00:00
Noah Levitt	6a09f2095c	handle exceptions in robots.txt fetching/parsing	2015-07-22 00:54:49 +00:00
Noah Levitt	f00571f7bd	fix youtube-dl exception handling	2015-07-22 00:53:39 +00:00
Noah Levitt	83a8e7cbe5	fix bug when --extra-header switch is not supplied	2015-07-21 20:39:41 +00:00
Noah Levitt	f9c049a69e	navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening	2015-07-21 20:39:19 +00:00
Noah Levitt	88f352efea	use new fork of youtube-dl with support for extra http headers on every request	2015-07-21 19:23:01 +00:00
Noah Levitt	b5cb94fc8b	some additional logging and error handling to avoid mysterious messages	2015-07-21 06:33:02 +00:00
Noah Levitt	1e56bc8686	add only one site at a time, specify settings with command line switches	2015-07-21 06:32:00 +00:00
Noah Levitt	38ddfe498d	require my "tweaks" branch of websocket-client	2015-07-20 16:06:47 -07:00
Noah Levitt	dc04048d50	add some info to the readme	2015-07-20 12:00:14 -07:00
Noah Levitt	2f28f00a09	make putmeta requests respect site configured extra_headers	2015-07-17 16:52:06 -07:00
Noah Levitt	2ba5bd4d4b	support adding extra http request headers	2015-07-17 13:45:27 -07:00
Noah Levitt	c178ed1950	fix buglet	2015-07-16 18:43:14 -07:00
Noah Levitt	a54e60dbaf	change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix	2015-07-16 18:39:29 -07:00
Noah Levitt	d2650a2547	update scope if seed redirects	2015-07-16 18:27:47 -07:00
Noah Levitt	140a441eb5	honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs	2015-07-16 17:19:12 -07:00
Noah Levitt	e04247c3f7	add support for supplying json blob defining site with configuration to brozzler-add-site	2015-07-16 14:48:01 -07:00
Noah Levitt	6b2ee9faee	chmod -x worker.py	2015-07-15 18:03:49 -07:00
Noah Levitt	f2bc7ec271	refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling	2015-07-15 18:03:03 -07:00
Noah Levitt	a9c51edd84	robots cache per site, and some so far unused support for site level configuration	2015-07-15 17:44:42 -07:00
Noah Levitt	efa3cd6269	don't set http_proxy environment variable, because it affects things we don't want it to	2015-07-15 17:33:29 -07:00
Noah Levitt	923cd98652	save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)	2015-07-15 16:32:02 -07:00
Noah Levitt	5aea76ab6d	refactor worker code into worker module	2015-07-15 15:42:40 -07:00
Noah Levitt	7b92ba39c7	avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)	2015-07-15 14:33:22 -07:00
Noah Levitt	9b13f0c34c	refactor hq code into hq module	2015-07-15 14:27:21 -07:00
Noah Levitt	4cfb287397	refactor hq code into hq module	2015-07-15 14:26:48 -07:00
Noah Levitt	9b5da57d7e	initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled	2015-07-14 18:57:45 -07:00
Noah Levitt	fd0c3322ee	update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff	2015-07-13 17:09:39 -07:00
Noah Levitt	3eff099b16	determine if youtube-dl can do something with a url	2015-07-13 16:40:56 -07:00
Noah Levitt	6470a8ef26	sigquit dumps thread traces	2015-07-13 15:57:14 -07:00
Noah Levitt	18ca996216	rudimentary robots.txt support	2015-07-13 15:56:54 -07:00
Noah Levitt	eb74967fed	brozzler-worker round-robins sites needing crawling	2015-07-13 12:13:41 -07:00
Noah Levitt	ddd764cac5	brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)	2015-07-11 23:07:47 -07:00
Noah Levitt	b0f3b8a5e3	clean shutdown for brozzler-hq	2015-07-11 18:18:54 -07:00
Noah Levitt	384120928c	set in_progress=0 for completed url	2015-07-11 13:24:38 -07:00
Noah Levitt	610f9c8cf4	add missing file hq.py, improve some logging, fix little race condition bug	2015-07-11 13:09:45 -07:00
Noah Levitt	bb3561a690	check scope (on hq side), fix buglets	2015-07-11 12:33:19 -07:00
Noah Levitt	1fb336cb2e	crawling outlinks not totally working	2015-07-11 02:29:19 -07:00
Noah Levitt	56a7bb7306	submit outlinks to hq	2015-07-10 21:31:41 -07:00
Noah Levitt	fd99764baa	brozzler-worker partially working	2015-07-10 21:07:47 -07:00
Noah Levitt	8aa1e6715a	feed seed url to the crawl url queue	2015-07-10 20:12:33 -07:00
Noah Levitt	1d068f4f86	starting work on brozzler crawl hq	2015-07-10 18:01:54 -07:00
Noah Levitt	fcc63b6675	fancier prioritization takes into account hops from seed, path depth; and clean shutdown	2015-07-09 22:35:37 -07:00

... 25 26 27 28 29 ...

1555 Commits