brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-07-01 01:57:34 -04:00

Author	SHA1	Message	Date
Noah Levitt	9db7744f2c	Merge branch 'master' into qa * master: fail quickly if browser dies at startup	2018-11-01 15:57:52 -07:00
Noah Levitt	15610fa990	fail quickly if browser dies at startup instead of trying to retrieve /json for 600 seconds	2018-11-01 15:57:03 -07:00
Noah Levitt	27ba877932	Merge branch 'master' into qa * master: handle exceptions extracting links fix reported chromium crash by removing argument bump version after merge remove stray bad logging line tests expect outlinks to be a set tidy up some comments and docs watch pages as outlinks from youtube-dl playlists silence youtube-dl's logging, use only our own use a thread-local callback in monkey-patched skip downloading videos from youtube playlists trace-level logging for all the chrome output	2018-10-29 17:45:09 -07:00
Noah Levitt	1073431f76	handle exceptions extracting links like this one: Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame. at __brzl_compileOutlinks (<anonymous>:4:24) at __brzl_compileOutlinks (<anonymous>:10:29) at <anonymous>:16:1 __brzl_compileOutlinks @ VM194:4 __brzl_compileOutlinks @ VM194:10 not sure exactly why this happens but we just have to handle it	2018-10-29 17:42:25 -07:00
Noah Levitt	af85f28908	fix reported chromium crash by removing argument --single-process https://github.com/internetarchive/brozzler/issues/128	2018-10-22 14:28:31 -07:00
Noah Levitt	20996fa501	bump version after merge	2018-10-12 12:46:09 -07:00
Noah Levitt	82cf5c6dbb	skip downloading videos from youtube playlists because we expect to capture videos from individual watch pages, and often processing thousands of videos with youtube-dl before the page is ever opened in the browser is not desired behavior and is a crawling problem	2018-10-11 15:46:30 -07:00
Noah Levitt	16c56fed5a	Merge branch 'master' into qa * master: hopefully fixes lingering ydl concurrency issue	2018-10-11 13:43:06 -07:00
Noah Levitt	1e95441ce7	hopefully fixes lingering ydl concurrency issue which was causing awfulness like this: 2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/ 2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/ 2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS 2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM 2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408 2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/ 2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice 2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE 2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0 2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104 i.e. pushing the same stitched-up video to a bunch of wrong places :(	2018-10-11 13:40:57 -07:00
Noah Levitt	ff64e32bd3	Merge branch 'master' into qa * master: brozzler-worker log version number at startup	2018-10-11 13:31:47 -07:00
Noah Levitt	e519616f8e	brozzler-worker log version number at startup	2018-10-11 13:31:37 -07:00
Noah Levitt	a75632bd95	Merge branch 'master' into qa * master: bump version after merge fix another oversight ugh. oops Revert "add a github PR template for this repo" improve performance of brozzler-new-job	2018-09-28 15:27:51 -07:00
Noah Levitt	362a2347b9	bump version after merge	2018-09-28 15:27:40 -07:00
Noah Levitt	87ec0f2f90	Merge branch 'master' into qa * master: bump doublethink dependency verbiage tweaks safety check and --force for brozzler-purge new command brozzler-purge	2018-09-28 11:12:35 -07:00
Noah Levitt	2386e85a37	bump doublethink dependency	2018-09-27 14:25:49 -07:00
Noah Levitt	174178e02e	new command brozzler-purge	2018-09-25 14:56:26 -07:00
Barbara Miller	60cfd684b2	Merge branch 'pageInterstitialShown' into qa	2018-09-25 10:30:02 -07:00
Noah Levitt	48bf185746	bump version after merge	2018-09-18 11:08:44 -07:00
Neil Minton	3c7fdeae2c	Merge branch 'ari-5777' into qa	2018-09-12 12:07:45 -04:00
Noah Levitt	efb0696833	bump version number after merge	2018-09-06 16:17:59 -07:00
jkafader	8368cd2bcb	Merge pull request #115 from nlevitt/ydl-stitched Ydl stitched	2018-09-06 16:15:52 -07:00
Noah Levitt	c4fdbe578d	Merge branch 'master' into qa * master: oops, back to dev version number wait 20 seconds to claim sites if none were avail- tweak logging why did those tests fail??? (#117) Add screenshots Add screenshots back to dev version 1.4 for pypi explain --warcprox-auto briefly vagrant readme fixes (thanks funkyfuture) update cryptography dep version	2018-09-04 10:54:26 -07:00
Noah Levitt	a4eacb5b8f	oops, back to dev version number	2018-09-04 10:52:34 -07:00
Noah Levitt	88d3d3b310	why did those tests fail??? (#117 ) 1.4 for pypi	2018-08-22 14:35:39 -07:00
Noah Levitt	2a2952e810	back to dev version	2018-08-21 15:18:18 -07:00
Noah Levitt	b63661ea70	1.4 for pypi	2018-08-21 15:15:38 -07:00
Noah Levitt	eaf7ef74be	explain --warcprox-auto briefly	2018-08-17 12:06:04 -07:00
Noah Levitt	8cdc3dee21	Merge branch 'master' into ydl-stitched * master: vagrant readme fixes (thanks funkyfuture) update cryptography dep version	2018-08-17 10:34:00 -07:00
Noah Levitt	d19e139101	vagrant readme fixes (thanks funkyfuture)	2018-08-17 10:31:01 -07:00
Noah Levitt	ffa8021968	update cryptography dep version github tells me there's a vulnerability <2.3	2018-08-16 14:32:03 -07:00
Noah Levitt	cbeba3a6b9	Merge branch 'ydl-stitched' into qa * ydl-stitched: fix failing tests test for youtube-dl stitch-up add missing imports and fix mimetype issue move youtube-dl code into separate file push youtube-dl's stitched up videos to warcprox	2018-08-16 12:10:44 -07:00
Noah Levitt	418a3ef20c	Merge branch 'master' into qa * master: expose more brozzle-page args update pillow dependency to get rid of github vul- more readme edits reformat readme to 80 columns Copy edits to job-conf readme bump up heartbeat interval (see comment) Copy edits back to dev version version 1.3 (messed up 1.2) setuptools wants README not readme back to dev version number version 1.2 bump dev version after merge is test_time_limit is failing because of timing?	2018-08-16 12:08:48 -07:00
Noah Levitt	3c27132aaa	test for youtube-dl stitch-up	2018-08-15 17:42:53 -07:00
Noah Levitt	39155ebcc5	push youtube-dl's stitched up videos to warcprox (no tests yet)	2018-08-13 15:40:48 -07:00
Noah Levitt	4e398e1da2	expose more brozzle-page args	2018-08-13 15:38:24 -07:00
Noah Levitt	b44a444dc2	update pillow dependency to get rid of github vul- nerability warning	2018-07-24 16:37:25 -05:00
Noah Levitt	9d18dc6aeb	bump up heartbeat interval (see comment)	2018-07-03 18:35:08 -05:00
Noah Levitt	783fd0ea87	back to dev version	2018-06-25 19:32:27 +00:00
Noah Levitt	bd63908fb9	version 1.3 (messed up 1.2)	2018-06-25 19:30:39 +00:00
Noah Levitt	2780c92569	setuptools wants README not readme	2018-06-25 19:10:57 +00:00
Noah Levitt	032c7d2898	back to dev version number	2018-06-25 12:33:34 -05:00
Noah Levitt	442d02b26a	version 1.2	2018-06-25 12:21:00 -05:00
Noah Levitt	196cd555ea	bump dev version after merge	2018-06-25 11:44:45 -05:00
Noah Levitt	27bdfb65d2	monkey-patch youtube-dl to short-circuit video extraction using generic extractor in case of very large url (more than 20 mb) that youtube-dl interprets as html, to avoid spinning forever here: Traceback (most recent call first): File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall return _compile(pattern, flags).findall(string) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract 'uploader': video_uploader, File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract ie_result = self._real_extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info ie_result = ie.extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl info = ydl.extract_info(str(urlcanon.whatwg(page.url))) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page self._try_youtube_dl(ydl, site, page)	2018-06-11 11:50:22 -07:00
Noah Levitt	109d05c59a	Merge branch 'master' into qa * master: monkey-patch youtube-dl to short-circuit	2018-06-11 11:11:09 -07:00
Noah Levitt	a90a29968c	monkey-patch youtube-dl to short-circuit video extraction using generic extractor in case of very large url (more than 20 mb) that youtube-dl interprets as html, to avoid spinning forever here: Traceback (most recent call first): File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall return _compile(pattern, flags).findall(string) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract 'uploader': video_uploader, File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract ie_result = self._real_extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info ie_result = ie.extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl info = ydl.extract_info(str(urlcanon.whatwg(page.url))) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page self._try_youtube_dl(ydl, site, page)	2018-06-11 11:08:20 -07:00
Noah Levitt	5c34bd3119	Merge branch 'master' into qa * master: lowercase readme.rst explain brozzler use of warcprox_meta update README copyright date bump dev version after PR #102 these ssurts are strings too fix bad copy/paste ssurts are strings now travis-ci install warcprox from github incorporate urlcanon fix update warcprox dependency to include recent fixes backward compatibility for old scope["surt"] missed a spot where is_permitted_by_robots needs monkeying handle new chrome cookie db schema describe scope rule conditions more explication of scoping update docs to match new seed ssurt behavior ok seriously tests fix more tests for new approach sans scope['surt'] s/max_hops_off_surt/max_hops_off/ new test of max_hops_off rename page.hops_off_surt to page.hops_off doublethink had a bug fix tests for new approach without scope['surt'] tests for new approach without of scope['surt'] WIP add an accept rule instead of modifying surt WIP some words on scoping WIP starting to flesh out "scoping" section WIP some explanation of automatic login WIP documentation!	2018-06-01 16:46:32 -07:00
Noah Levitt	62bb540a11	lowercase readme.rst	2018-05-31 18:46:37 +00:00
Noah Levitt	8906037d82	bump dev version after PR #102	2018-05-16 17:33:52 -07:00
Noah Levitt	e90e7345a5	Merge pull request #102 from nlevitt/docs complete job configuration documentation	2018-05-16 17:31:27 -07:00

1 2 3 4 5 ...

398 commits