brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 08:39:59 -05:00

Author	SHA1	Message	Date
Noah Levitt	2f28f00a09	make putmeta requests respect site configured extra_headers	2015-07-17 16:52:06 -07:00
Noah Levitt	2ba5bd4d4b	support adding extra http request headers	2015-07-17 13:45:27 -07:00
Noah Levitt	c178ed1950	fix buglet	2015-07-16 18:43:14 -07:00
Noah Levitt	a54e60dbaf	change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix	2015-07-16 18:39:29 -07:00
Noah Levitt	d2650a2547	update scope if seed redirects	2015-07-16 18:27:47 -07:00
Noah Levitt	140a441eb5	honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs	2015-07-16 17:19:12 -07:00
Noah Levitt	e04247c3f7	add support for supplying json blob defining site with configuration to brozzler-add-site	2015-07-16 14:48:01 -07:00
Noah Levitt	6b2ee9faee	chmod -x worker.py	2015-07-15 18:03:49 -07:00
Noah Levitt	f2bc7ec271	refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling	2015-07-15 18:03:03 -07:00
Noah Levitt	a9c51edd84	robots cache per site, and some so far unused support for site level configuration	2015-07-15 17:44:42 -07:00
Noah Levitt	efa3cd6269	don't set http_proxy environment variable, because it affects things we don't want it to	2015-07-15 17:33:29 -07:00
Noah Levitt	923cd98652	save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)	2015-07-15 16:32:02 -07:00
Noah Levitt	5aea76ab6d	refactor worker code into worker module	2015-07-15 15:42:40 -07:00
Noah Levitt	7b92ba39c7	avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)	2015-07-15 14:33:22 -07:00
Noah Levitt	9b13f0c34c	refactor hq code into hq module	2015-07-15 14:27:21 -07:00
Noah Levitt	4cfb287397	refactor hq code into hq module	2015-07-15 14:26:48 -07:00
Noah Levitt	9b5da57d7e	initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled	2015-07-14 18:57:45 -07:00
Noah Levitt	fd0c3322ee	update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff	2015-07-13 17:09:39 -07:00
Noah Levitt	3eff099b16	determine if youtube-dl can do something with a url	2015-07-13 16:40:56 -07:00
Noah Levitt	6470a8ef26	sigquit dumps thread traces	2015-07-13 15:57:14 -07:00
Noah Levitt	18ca996216	rudimentary robots.txt support	2015-07-13 15:56:54 -07:00
Noah Levitt	eb74967fed	brozzler-worker round-robins sites needing crawling	2015-07-13 12:13:41 -07:00
Noah Levitt	ddd764cac5	brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)	2015-07-11 23:07:47 -07:00
Noah Levitt	b0f3b8a5e3	clean shutdown for brozzler-hq	2015-07-11 18:18:54 -07:00
Noah Levitt	384120928c	set in_progress=0 for completed url	2015-07-11 13:24:38 -07:00
Noah Levitt	610f9c8cf4	add missing file hq.py, improve some logging, fix little race condition bug	2015-07-11 13:09:45 -07:00
Noah Levitt	bb3561a690	check scope (on hq side), fix buglets	2015-07-11 12:33:19 -07:00
Noah Levitt	1fb336cb2e	crawling outlinks not totally working	2015-07-11 02:29:19 -07:00
Noah Levitt	56a7bb7306	submit outlinks to hq	2015-07-10 21:31:41 -07:00
Noah Levitt	fd99764baa	brozzler-worker partially working	2015-07-10 21:07:47 -07:00
Noah Levitt	8aa1e6715a	feed seed url to the crawl url queue	2015-07-10 20:12:33 -07:00
Noah Levitt	1d068f4f86	starting work on brozzler crawl hq	2015-07-10 18:01:54 -07:00
Noah Levitt	fcc63b6675	fancier prioritization takes into account hops from seed, path depth; and clean shutdown	2015-07-09 22:35:37 -07:00
Noah Levitt	5f3c247e0c	trick to avoid crawling same url again too quickly	2015-07-09 21:49:55 -07:00
Noah Levitt	7cc777661d	fix dumb bug	2015-07-09 18:54:09 -07:00
Noah Levitt	783794ca37	basic of site/seed crawling with scoping	2015-07-09 18:36:07 -07:00
Noah Levitt	92ea701987	rudimentary crawling in parallel with multiple browsers	2015-07-08 18:50:18 -07:00
Noah Levitt	32abfcac8a	fix 'CrawlUrl' object has no attribute 'priority' bug	2015-07-08 17:51:09 -07:00
Noah Levitt	4022cc0162	simple in-memory frontier with prioritized queues by host	2015-07-08 17:44:38 -07:00
Noah Levitt	4042f22497	rudimentary link extraction and crawling	2015-07-07 16:45:52 -07:00
Noah Levitt	d8a962b29e	experimenting with captureScreenshot	2015-06-16 18:42:21 -07:00
Noah Levitt	f254e2eec1	it's been stable, call it 1.0	2015-06-13 11:30:01 -07:00
Hunter	903d2f3107	Merge pull request #39 from nlevitt/simple-behaviors ARI-3775, ARI-3956 Simple behaviors	2015-04-16 15:01:49 -07:00
Noah Levitt	73bbd87d5d	merge in latest from master and adjust config as needed	2015-02-02 14:52:56 -08:00
Noah Levitt	776a6dac68	Merge branch 'master' into simple-behaviors	2015-02-02 14:49:34 -08:00
Noah Levitt	48b8754f40	Merge branch 'master' into simple-behaviors	2015-02-02 14:48:26 -08:00
Noah Levitt	db759f1066	Merge pull request #32 from adam-miller/ARI-3904 ARI-3904 Instagram behavior to scroll past two pages, and click to enla...	2015-02-02 14:47:44 -08:00
Adam Miller	ce47461656	Making scrolling and image loading more tolerant of slow loading.	2015-01-30 16:55:53 -08:00
Noah Levitt	9e5900c61f	ARI-3956 simple behavior for usask.ca slideshows (which also required enhancing the simple behavior logic)	2015-01-27 16:03:58 -08:00
Noah Levitt	0901cac2e0	Merge pull request #38 from nlevitt/bump-browser-timeout increase browser start and stop timeouts, since sometimes we strand brow...	2015-01-26 21:22:18 -08:00

... 3 4 5 6 7 ...

438 Commits