Commit graph

1587 commits

Author SHA1 Message Date
Barbara Miller
cbd6f0f90a Merge branch 'insta18q4' into qa 2018-12-13 17:29:36 -08:00
Barbara Miller
425d44bf4a updates for jina2 2018-12-13 17:27:15 -08:00
Barbara Miller
6c21a9f773 iframe option and other instagram updates 2018-12-13 15:54:10 -08:00
Noah Levitt
15870e6010 avoid IndexError
in some cases we receive this event from the browser:
{"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}
2018-12-13 15:49:38 -08:00
Noah Levitt
b577fe3c36 log browser uncaught exceptions at debug level
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Barbara Miller
c50e9637ae Merge branch 'insta18q4' into qa 2018-12-09 14:26:38 -08:00
Barbara Miller
cb0c0f51ef iframe option and other instagram updates 2018-12-09 14:25:59 -08:00
Noah Levitt
b447063099 Merge branch 'master' into qa
* master:
  bump version after merge
  change time limit enforcement
2018-11-29 14:52:32 -08:00
Noah Levitt
ebcc063fe2 bump version after merge 2018-11-29 14:52:11 -08:00
jkafader
898756690f
Merge pull request #142 from nlevitt/service-worker
fetch service worker script with proper headers
2018-11-29 13:42:59 -08:00
jkafader
9c27e829aa
Merge pull request #136 from nlevitt/revert-time-limit
change time limit enforcement
2018-11-29 12:29:35 -08:00
Noah Levitt
983ed7bc60 Merge branch 'service-worker' into qa
* service-worker:
  fix tests
2018-11-27 16:07:35 -08:00
Noah Levitt
db62402be8 fix tests 2018-11-27 14:35:00 -08:00
Barbara Miller
b204e9aec1 Merge branch 'service-worker' into qa 2018-11-27 12:58:47 -08:00
Noah Levitt
f63947cfe9 fetch service worker script with proper headers 2018-11-27 12:35:33 -08:00
Barbara Miller
bc6a2f4b95 handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 22:30:49 -08:00
Noah Levitt
574af7846e bump version after merge 2018-11-16 15:10:46 -08:00
Barbara Miller
e2b2542d4a handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
05fab8b909 change time limit enforcement
enforce time limit based on all the time that a site was in active
rotation, including time it spent waiting for its turn to be brozzled;
this undoes the change from b9640b8a30, because now it seems that
was the wrong decision (brozzler jobs with many seeds and low
max_claimed_sites hanging around forever)
2018-11-12 16:21:38 -08:00
Noah Levitt
9db7744f2c Merge branch 'master' into qa
* master:
  fail quickly if browser dies at startup
2018-11-01 15:57:52 -07:00
Noah Levitt
15610fa990 fail quickly if browser dies at startup
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
27ba877932 Merge branch 'master' into qa
* master:
  handle exceptions extracting links
  fix reported chromium crash by removing argument
  bump version after merge
  remove stray bad logging line
  tests expect outlinks to be a set
  tidy up some comments and docs
  watch pages as outlinks from youtube-dl playlists
  silence youtube-dl's logging, use only our own
  use a thread-local callback in monkey-patched
  skip downloading videos from youtube playlists
  trace-level logging for all the chrome output
2018-10-29 17:45:09 -07:00
Noah Levitt
1073431f76 handle exceptions extracting links
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame.
    at __brzl_compileOutlinks (<anonymous>:4:24)
    at __brzl_compileOutlinks (<anonymous>:10:29)
    at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10

not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908 fix reported chromium crash by removing argument
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Barbara Miller
f3f9505657 Merge branch 'pageInterstitialShown' into qa 2018-10-17 17:24:39 -07:00
Barbara Miller
181e7ab85d page.interstitial exception test 2018-10-17 17:23:50 -07:00
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
jkafader
8fc800d1ef
Merge pull request #127 from nlevitt/ydl-improvements
Ydl improvements
2018-10-12 11:55:47 -07:00
Noah Levitt
65fad5e8bf remove stray bad logging line 2018-10-12 11:35:47 -07:00
Noah Levitt
7497b7e5ac tests expect outlinks to be a set 2018-10-12 11:03:54 -07:00
Noah Levitt
054ba6d7a0 tidy up some comments and docs 2018-10-12 00:48:38 -07:00
Noah Levitt
8f9077fbf3 watch pages as outlinks from youtube-dl playlists
and bypass downloading metadata about individual videos as well as the
videos themselves (for youtube playlists), because even just the
metadata can take many minutes or hours in case of thousands of videos
2018-10-12 00:41:16 -07:00
Noah Levitt
9211fb45ec silence youtube-dl's logging, use only our own
because youtube-dl's can be annoyingly verbose, confusing, doesn't tell
us the things we're interested in, and doesn't tell us where the
messages originate
2018-10-12 00:39:37 -07:00
Noah Levitt
e5536182dc use a thread-local callback in monkey-patched
finish_frag_download, instead of locking around monkey-patching, to
allow different threads to youtube-dl concurrently, but still not
interfere with each other
2018-10-11 23:28:34 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00
Noah Levitt
e406e42312 trace-level logging for all the chrome output
because the important/unimportant messages are always shifting and we're
not even trying to keep up with it and mostly it's just noise
2018-10-11 15:43:05 -07:00
Noah Levitt
16c56fed5a Merge branch 'master' into qa
* master:
  hopefully fixes lingering ydl concurrency issue
2018-10-11 13:43:06 -07:00
Noah Levitt
1e95441ce7 hopefully fixes lingering ydl concurrency issue
which was causing awfulness like this:
2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/
2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/
2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS
2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM
2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408
2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/
2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice
2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE
2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0
2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104
i.e. pushing the same stitched-up video to a bunch of wrong places :(
2018-10-11 13:40:57 -07:00
Noah Levitt
ff64e32bd3 Merge branch 'master' into qa
* master:
  brozzler-worker log version number at startup
2018-10-11 13:31:47 -07:00
Noah Levitt
e519616f8e brozzler-worker log version number at startup 2018-10-11 13:31:37 -07:00
Noah Levitt
a75632bd95 Merge branch 'master' into qa
* master:
  bump version after merge
  fix another oversight
  ugh. oops
  Revert "add a github PR template for this repo"
  improve performance of brozzler-new-job
2018-09-28 15:27:51 -07:00
Noah Levitt
362a2347b9 bump version after merge 2018-09-28 15:27:40 -07:00
jkafader
d2b1843a6d
Merge pull request #122 from nlevitt/new-job-bulk-inserts
improve performance of brozzler-new-job
2018-09-28 15:26:33 -07:00
Noah Levitt
8de3e21103 fix another oversight 2018-09-28 14:45:45 -07:00
Noah Levitt
1ee36c38b9 ugh. oops 2018-09-28 13:44:54 -07:00
Noah Levitt
7137918005 Revert "add a github PR template for this repo"
This reverts commit 83552eb444.
2018-09-28 13:16:46 -07:00
Noah Levitt
7980b40ee3 improve performance of brozzler-new-job
by inserting pages and sites in bulk
2018-09-28 13:13:22 -07:00
Noah Levitt
87ec0f2f90 Merge branch 'master' into qa
* master:
  bump doublethink dependency
  verbiage tweaks
  safety check and --force for brozzler-purge
  new command brozzler-purge
2018-09-28 11:12:35 -07:00
Barbara Miller
24fcca4919 Merge branch 'pageInterstitialShown' into qa 2018-09-27 16:31:28 -07:00
Barbara Miller
1054d2d644 return outlinks = [] 2018-09-27 16:22:12 -07:00