1262 Commits

Author SHA1 Message Date
Barbara Miller
d8f97e7b3f no current need for skipIframes with new try/catch 2018-12-20 11:24:30 -08:00
Noah Levitt
034f7938c4 catch common exception in default behavior 2018-12-20 10:46:05 -08:00
Noah Levitt
2cd64811b3 bump version after merge 2018-12-17 15:10:26 -08:00
Noah Levitt
d8c9dd2ff4
Merge pull request #144 from galgeek/umbraBehavior18q4
fix instagram captures; add skipIframe feature
2018-12-17 15:09:52 -08:00
Barbara Miller
4a0d95277f update umbraBehavior 2018-12-17 15:04:36 -08:00
Barbara Miller
425d44bf4a updates for jina2 2018-12-13 17:27:15 -08:00
Barbara Miller
6c21a9f773 iframe option and other instagram updates 2018-12-13 15:54:10 -08:00
Noah Levitt
15870e6010 avoid IndexError
in some cases we receive this event from the browser:
{"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}
2018-12-13 15:49:38 -08:00
Noah Levitt
b577fe3c36 log browser uncaught exceptions at debug level
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Noah Levitt
ebcc063fe2 bump version after merge 2018-11-29 14:52:11 -08:00
jkafader
898756690f
Merge pull request #142 from nlevitt/service-worker
fetch service worker script with proper headers
2018-11-29 13:42:59 -08:00
jkafader
9c27e829aa
Merge pull request #136 from nlevitt/revert-time-limit
change time limit enforcement
2018-11-29 12:29:35 -08:00
Noah Levitt
db62402be8 fix tests 2018-11-27 14:35:00 -08:00
Noah Levitt
f63947cfe9 fetch service worker script with proper headers 2018-11-27 12:35:33 -08:00
Noah Levitt
574af7846e bump version after merge 2018-11-16 15:10:46 -08:00
Barbara Miller
e2b2542d4a handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
05fab8b909 change time limit enforcement
enforce time limit based on all the time that a site was in active
rotation, including time it spent waiting for its turn to be brozzled;
this undoes the change from b9640b8a30c934, because now it seems that
was the wrong decision (brozzler jobs with many seeds and low
max_claimed_sites hanging around forever)
2018-11-12 16:21:38 -08:00
Noah Levitt
15610fa990 fail quickly if browser dies at startup
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
1073431f76 handle exceptions extracting links
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame.
    at __brzl_compileOutlinks (<anonymous>:4:24)
    at __brzl_compileOutlinks (<anonymous>:10:29)
    at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10

not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908 fix reported chromium crash by removing argument
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
jkafader
8fc800d1ef
Merge pull request #127 from nlevitt/ydl-improvements
Ydl improvements
2018-10-12 11:55:47 -07:00
Noah Levitt
65fad5e8bf remove stray bad logging line 2018-10-12 11:35:47 -07:00
Noah Levitt
7497b7e5ac tests expect outlinks to be a set 2018-10-12 11:03:54 -07:00
Noah Levitt
054ba6d7a0 tidy up some comments and docs 2018-10-12 00:48:38 -07:00
Noah Levitt
8f9077fbf3 watch pages as outlinks from youtube-dl playlists
and bypass downloading metadata about individual videos as well as the
videos themselves (for youtube playlists), because even just the
metadata can take many minutes or hours in case of thousands of videos
2018-10-12 00:41:16 -07:00
Noah Levitt
9211fb45ec silence youtube-dl's logging, use only our own
because youtube-dl's can be annoyingly verbose, confusing, doesn't tell
us the things we're interested in, and doesn't tell us where the
messages originate
2018-10-12 00:39:37 -07:00
Noah Levitt
e5536182dc use a thread-local callback in monkey-patched
finish_frag_download, instead of locking around monkey-patching, to
allow different threads to youtube-dl concurrently, but still not
interfere with each other
2018-10-11 23:28:34 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00
Noah Levitt
e406e42312 trace-level logging for all the chrome output
because the important/unimportant messages are always shifting and we're
not even trying to keep up with it and mostly it's just noise
2018-10-11 15:43:05 -07:00
Noah Levitt
1e95441ce7 hopefully fixes lingering ydl concurrency issue
which was causing awfulness like this:
2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/
2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/
2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS
2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM
2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408
2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/
2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice
2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE
2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0
2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104
i.e. pushing the same stitched-up video to a bunch of wrong places :(
2018-10-11 13:40:57 -07:00
Noah Levitt
e519616f8e brozzler-worker log version number at startup 2018-10-11 13:31:37 -07:00
Noah Levitt
362a2347b9 bump version after merge 2018-09-28 15:27:40 -07:00
jkafader
d2b1843a6d
Merge pull request #122 from nlevitt/new-job-bulk-inserts
improve performance of brozzler-new-job
2018-09-28 15:26:33 -07:00
Noah Levitt
8de3e21103 fix another oversight 2018-09-28 14:45:45 -07:00
Noah Levitt
1ee36c38b9 ugh. oops 2018-09-28 13:44:54 -07:00
Noah Levitt
7137918005 Revert "add a github PR template for this repo"
This reverts commit 83552eb444fc5ef9bd4b9d1772820b62aae67e46.
2018-09-28 13:16:46 -07:00
Noah Levitt
7980b40ee3 improve performance of brozzler-new-job
by inserting pages and sites in bulk
2018-09-28 13:13:22 -07:00
Noah Levitt
2386e85a37 bump doublethink dependency 2018-09-27 14:25:49 -07:00
jkafader
f8ce9858fb
Merge pull request #121 from nlevitt/purge
brozzler-purge - purge crawl state from rethinkdb
2018-09-25 16:45:24 -07:00
Noah Levitt
f4fad934a7 verbiage tweaks 2018-09-25 15:19:33 -07:00
Noah Levitt
560981c1ad safety check and --force for brozzler-purge 2018-09-25 15:17:45 -07:00
Noah Levitt
174178e02e new command brozzler-purge 2018-09-25 14:56:26 -07:00
Noah Levitt
48bf185746 bump version after merge 2018-09-18 11:08:44 -07:00
Noah Levitt
dceee8bdbd
Merge pull request #119 from nlevitt/ydl-stitch-fix
WIP youtube-dl stitching fixes
2018-09-18 11:08:21 -07:00
Noah Levitt
60cd69e2bd send warcprox-meta when pushing stitched up video
also put locking around monkey patching to avoid race condition
2018-09-18 01:07:52 -07:00
Noah Levitt
1ef717fa75 test exposing bug that we don't send warcprox-meta
when pushing stitched-up video with WARCPROX_WRITE_RECORD
2018-09-18 01:05:18 -07:00
Noah Levitt
efb0696833 bump version number after merge 2018-09-06 16:17:59 -07:00
jkafader
8368cd2bcb
Merge pull request #115 from nlevitt/ydl-stitched
Ydl stitched
2018-09-06 16:15:52 -07:00
Noah Levitt
a4eacb5b8f oops, back to dev version number 2018-09-04 10:52:34 -07:00