brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 08:39:59 -05:00

Author	SHA1	Message	Date
Noah Levitt	19522aff85	adjusting ansible config for xenial untested because of vagrant problems	2019-03-19 16:37:13 -07:00
Noah Levitt	d4f8bc768f	trying to make this work with xenial for travis see error https://travis-ci.org/internetarchive/brozzler/jobs/508141058	2019-03-18 16:38:23 -07:00
Noah Levitt	f2a9908395	travis only has py 3.7 for xenial	2019-03-18 16:20:54 -07:00
Noah Levitt	d729c8d0d5	use yaml.safe_load() getting new warnings see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation	2019-03-18 15:49:44 -07:00
Noah Levitt	6f5f090c33	test py 3.7	2019-03-18 15:49:03 -07:00
Noah Levitt	ef981706f4	fix rethinkdb dependency version	2019-03-18 15:08:36 -07:00
Noah Levitt	61274ae994	peg to working doublethink see: https://github.com/internetarchive/doublethink/commit/f7fc7da725c9b	2019-03-14 20:04:09 +00:00
Noah Levitt	7d5bb4b5d4	Merge pull request #148 from vbanos/disk-cache Add disk cache options to Chrome	2019-02-12 14:39:49 -08:00
Vangelis Banos	9c48a6fa11	Use disk cache params only on Chrome.start Use `disk_cache_dir` and `disk_cache_size` only on `Chrome.start` and not on `Chrome.__init__`. Drop `disk_cache_dir` and `disk_cache_size` class attributes.	2019-02-12 20:59:08 +00:00
Vangelis Banos	adeca823dd	Remove stale comment	2019-02-12 07:21:44 +00:00
Vangelis Banos	31e611771e	Improve disk cache options Remove `--disable-cache`, its not used any more. Rename `disk_cache` to `disk_cache_dir` and use only path (str) argument. Decouple `--disk-cache-size` from `--disk-cache-dir` so it is possible to use either or both.	2019-02-07 07:42:45 +00:00
Vangelis Banos	c288c9ae98	Add disk cache options to Chrome Add `Chrome` options `disk_cache` and `disk_cache_size` which add chromium options `--disk-cache-dir=<DIR>` and `--disk-cache-size=N` (bytes). The default is to use `--disable-cache` (no disk caching). There are two ways to use the new vars, if you just use `Chrome(disk_cache=True)` the chromium cli option `--disable-cache` is NOT used and chromium writes disk cache inside profile dir. If you use `Chrome(disk_cache='/tmp/custom_dir', disk_cache_size=10000)` chromium will use `--disk-cache-dir=/tmp/custom_dir --disk-cache-size=10000`.	2019-02-06 16:22:10 +00:00
Noah Levitt	809ea3885f	Merge pull request #147 from galgeek/bye_simpleclicks no more simpleclicks/mouseovers	2019-01-14 13:48:48 -08:00
Barbara Miller	f6ffb4acea	update (C)	2019-01-10 16:11:24 -08:00
Barbara Miller	9001156b54	rm simpleclicks.js.j2 mouseovers.js.j2	2019-01-10 15:58:38 -08:00
Barbara Miller	770ea6de1e	no more simpleclicks/mouseovers	2019-01-10 15:54:47 -08:00
Barbara Miller	e1ceb87ca2	Merge pull request #146 from nlevitt/https-redirect least surprise on http/https seed redirects	2018-12-21 15:26:04 -08:00
Noah Levitt	a74f46dc53	least surprise on http/https seed redirects if http://foo.com/ redirects to https://foo.com/a/b/c let's also put all of https://foo.com/ in scope	2018-12-21 15:17:31 -08:00
Noah Levitt	6b8e597a43	bump version after merge	2018-12-20 11:30:49 -08:00
Noah Levitt	0a08c01461	Merge pull request #145 from galgeek/no-skipIframes no skipIframes for umbraBehavior	2018-12-20 11:30:28 -08:00
Barbara Miller	047b46bc4e	back out now unnecessary updates	2018-12-20 11:25:06 -08:00
Barbara Miller	d8f97e7b3f	no current need for skipIframes with new try/catch	2018-12-20 11:24:30 -08:00
Noah Levitt	034f7938c4	catch common exception in default behavior	2018-12-20 10:46:05 -08:00
Noah Levitt	2cd64811b3	bump version after merge	2018-12-17 15:10:26 -08:00
Noah Levitt	d8c9dd2ff4	Merge pull request #144 from galgeek/umbraBehavior18q4 fix instagram captures; add skipIframe feature	2018-12-17 15:09:52 -08:00
Barbara Miller	4a0d95277f	update umbraBehavior	2018-12-17 15:04:36 -08:00
Barbara Miller	425d44bf4a	updates for jina2	2018-12-13 17:27:15 -08:00
Barbara Miller	6c21a9f773	iframe option and other instagram updates	2018-12-13 15:54:10 -08:00
Noah Levitt	15870e6010	avoid IndexError in some cases we receive this event from the browser: {"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}	2018-12-13 15:49:38 -08:00
Noah Levitt	b577fe3c36	log browser uncaught exceptions at debug level didn't realize these weren't showing up as console messages	2018-12-13 15:45:35 -08:00
Noah Levitt	ebcc063fe2	bump version after merge	2018-11-29 14:52:11 -08:00
jkafader	898756690f	Merge pull request #142 from nlevitt/service-worker fetch service worker script with proper headers	2018-11-29 13:42:59 -08:00
jkafader	9c27e829aa	Merge pull request #136 from nlevitt/revert-time-limit change time limit enforcement	2018-11-29 12:29:35 -08:00
Noah Levitt	db62402be8	fix tests	2018-11-27 14:35:00 -08:00
Noah Levitt	f63947cfe9	fetch service worker script with proper headers	2018-11-27 12:35:33 -08:00
Noah Levitt	574af7846e	bump version after merge	2018-11-16 15:10:46 -08:00
Barbara Miller	e2b2542d4a	handle http auth (#138 ) abort brozzling on insterstial (auth dialog) because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing) https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse	2018-11-16 15:10:30 -08:00
Noah Levitt	05fab8b909	change time limit enforcement enforce time limit based on all the time that a site was in active rotation, including time it spent waiting for its turn to be brozzled; this undoes the change from b9640b8a30c934, because now it seems that was the wrong decision (brozzler jobs with many seeds and low max_claimed_sites hanging around forever)	2018-11-12 16:21:38 -08:00
Noah Levitt	15610fa990	fail quickly if browser dies at startup instead of trying to retrieve /json for 600 seconds	2018-11-01 15:57:03 -07:00
Noah Levitt	1073431f76	handle exceptions extracting links like this one: Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame. at __brzl_compileOutlinks (<anonymous>:4:24) at __brzl_compileOutlinks (<anonymous>:10:29) at <anonymous>:16:1 __brzl_compileOutlinks @ VM194:4 __brzl_compileOutlinks @ VM194:10 not sure exactly why this happens but we just have to handle it	2018-10-29 17:42:25 -07:00
Noah Levitt	af85f28908	fix reported chromium crash by removing argument --single-process https://github.com/internetarchive/brozzler/issues/128	2018-10-22 14:28:31 -07:00
Noah Levitt	20996fa501	bump version after merge	2018-10-12 12:46:09 -07:00
jkafader	8fc800d1ef	Merge pull request #127 from nlevitt/ydl-improvements Ydl improvements	2018-10-12 11:55:47 -07:00
Noah Levitt	65fad5e8bf	remove stray bad logging line	2018-10-12 11:35:47 -07:00
Noah Levitt	7497b7e5ac	tests expect outlinks to be a set	2018-10-12 11:03:54 -07:00
Noah Levitt	054ba6d7a0	tidy up some comments and docs	2018-10-12 00:48:38 -07:00
Noah Levitt	8f9077fbf3	watch pages as outlinks from youtube-dl playlists and bypass downloading metadata about individual videos as well as the videos themselves (for youtube playlists), because even just the metadata can take many minutes or hours in case of thousands of videos	2018-10-12 00:41:16 -07:00
Noah Levitt	9211fb45ec	silence youtube-dl's logging, use only our own because youtube-dl's can be annoyingly verbose, confusing, doesn't tell us the things we're interested in, and doesn't tell us where the messages originate	2018-10-12 00:39:37 -07:00
Noah Levitt	e5536182dc	use a thread-local callback in monkey-patched finish_frag_download, instead of locking around monkey-patching, to allow different threads to youtube-dl concurrently, but still not interfere with each other	2018-10-11 23:28:34 -07:00
Noah Levitt	82cf5c6dbb	skip downloading videos from youtube playlists because we expect to capture videos from individual watch pages, and often processing thousands of videos with youtube-dl before the page is ever opened in the browser is not desired behavior and is a crawling problem	2018-10-11 15:46:30 -07:00

... 6 7 8 9 10 ...

1483 Commits