422 Commits

Author SHA1 Message Date
Barbara Miller
4ada3e01b7 Merge branch 'typos' into qa 2019-05-17 17:24:19 -07:00
Noah Levitt
aa2d491009 i don't know where pyyaml 5.8 came from 2019-05-16 01:29:05 -07:00
Noah Levitt
42ddfba923
Merge pull request #150 from nlevitt/purge-old
Purge old
2019-05-16 00:29:58 -07:00
Noah Levitt
f8db17ce3d bump version after merge 2019-05-16 00:22:29 -07:00
Barbara Miller
533a5e74ee Merge branch 'requestIntercepted' into qa 2019-05-14 12:00:23 -07:00
Noah Levitt
ee8ef23f0c fix mistake in job-conf.rst 2019-04-30 10:49:48 -07:00
Noah Levitt
411b3f266a bump version after merge 2019-04-09 22:07:51 +00:00
Noah Levitt
06e072a716 update some dependencies 2019-04-02 17:58:35 +00:00
Noah Levitt
d729c8d0d5 use yaml.safe_load()
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Noah Levitt
ef981706f4 fix rethinkdb dependency version 2019-03-18 15:08:36 -07:00
Noah Levitt
c686fc7443 Merge branch 'master' into qa
* master:
  peg to working doublethink
  Use disk cache params only on Chrome.start
  Remove stale comment
  Improve disk cache options
  Add disk cache options to Chrome
  update (C)
2019-03-14 20:06:12 +00:00
Noah Levitt
61274ae994 peg to working doublethink
see: https://github.com/internetarchive/doublethink/commit/f7fc7da725c9b
2019-03-14 20:04:09 +00:00
Barbara Miller
8b93c078b7 Merge branch 'instaInterval' into qa 2018-12-21 14:41:27 -08:00
Noah Levitt
6b8e597a43 bump version after merge 2018-12-20 11:30:49 -08:00
Barbara Miller
bf8bbfba27 Merge branch 'no-skipIframes' into qa 2018-12-20 11:25:54 -08:00
Noah Levitt
034f7938c4 catch common exception in default behavior 2018-12-20 10:46:05 -08:00
Noah Levitt
2cd64811b3 bump version after merge 2018-12-17 15:10:26 -08:00
Barbara Miller
cbd6f0f90a Merge branch 'insta18q4' into qa 2018-12-13 17:29:36 -08:00
Noah Levitt
15870e6010 avoid IndexError
in some cases we receive this event from the browser:
{"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}
2018-12-13 15:49:38 -08:00
Noah Levitt
b577fe3c36 log browser uncaught exceptions at debug level
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Noah Levitt
b447063099 Merge branch 'master' into qa
* master:
  bump version after merge
  change time limit enforcement
2018-11-29 14:52:32 -08:00
Noah Levitt
ebcc063fe2 bump version after merge 2018-11-29 14:52:11 -08:00
Barbara Miller
b204e9aec1 Merge branch 'service-worker' into qa 2018-11-27 12:58:47 -08:00
Noah Levitt
574af7846e bump version after merge 2018-11-16 15:10:46 -08:00
Noah Levitt
9db7744f2c Merge branch 'master' into qa
* master:
  fail quickly if browser dies at startup
2018-11-01 15:57:52 -07:00
Noah Levitt
15610fa990 fail quickly if browser dies at startup
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
27ba877932 Merge branch 'master' into qa
* master:
  handle exceptions extracting links
  fix reported chromium crash by removing argument
  bump version after merge
  remove stray bad logging line
  tests expect outlinks to be a set
  tidy up some comments and docs
  watch pages as outlinks from youtube-dl playlists
  silence youtube-dl's logging, use only our own
  use a thread-local callback in monkey-patched
  skip downloading videos from youtube playlists
  trace-level logging for all the chrome output
2018-10-29 17:45:09 -07:00
Noah Levitt
1073431f76 handle exceptions extracting links
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame.
    at __brzl_compileOutlinks (<anonymous>:4:24)
    at __brzl_compileOutlinks (<anonymous>:10:29)
    at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10

not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908 fix reported chromium crash by removing argument
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00
Noah Levitt
16c56fed5a Merge branch 'master' into qa
* master:
  hopefully fixes lingering ydl concurrency issue
2018-10-11 13:43:06 -07:00
Noah Levitt
1e95441ce7 hopefully fixes lingering ydl concurrency issue
which was causing awfulness like this:
2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/
2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/
2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS
2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM
2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408
2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/
2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice
2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE
2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0
2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104
i.e. pushing the same stitched-up video to a bunch of wrong places :(
2018-10-11 13:40:57 -07:00
Noah Levitt
ff64e32bd3 Merge branch 'master' into qa
* master:
  brozzler-worker log version number at startup
2018-10-11 13:31:47 -07:00
Noah Levitt
e519616f8e brozzler-worker log version number at startup 2018-10-11 13:31:37 -07:00
Noah Levitt
a75632bd95 Merge branch 'master' into qa
* master:
  bump version after merge
  fix another oversight
  ugh. oops
  Revert "add a github PR template for this repo"
  improve performance of brozzler-new-job
2018-09-28 15:27:51 -07:00
Noah Levitt
362a2347b9 bump version after merge 2018-09-28 15:27:40 -07:00
Noah Levitt
87ec0f2f90 Merge branch 'master' into qa
* master:
  bump doublethink dependency
  verbiage tweaks
  safety check and --force for brozzler-purge
  new command brozzler-purge
2018-09-28 11:12:35 -07:00
Noah Levitt
2386e85a37 bump doublethink dependency 2018-09-27 14:25:49 -07:00
Noah Levitt
174178e02e new command brozzler-purge 2018-09-25 14:56:26 -07:00
Barbara Miller
60cfd684b2 Merge branch 'pageInterstitialShown' into qa 2018-09-25 10:30:02 -07:00
Noah Levitt
48bf185746 bump version after merge 2018-09-18 11:08:44 -07:00
Neil Minton
3c7fdeae2c Merge branch 'ari-5777' into qa 2018-09-12 12:07:45 -04:00
Noah Levitt
efb0696833 bump version number after merge 2018-09-06 16:17:59 -07:00
jkafader
8368cd2bcb
Merge pull request #115 from nlevitt/ydl-stitched
Ydl stitched
2018-09-06 16:15:52 -07:00
Noah Levitt
c4fdbe578d Merge branch 'master' into qa
* master:
  oops, back to dev version number
  wait 20 seconds to claim sites if none were avail-
  tweak logging
  why did those tests fail??? (#117)
  Add screenshots
  Add screenshots
  back to dev version
  1.4 for pypi
  explain --warcprox-auto briefly
  vagrant readme fixes (thanks funkyfuture)
  update cryptography dep version
2018-09-04 10:54:26 -07:00
Noah Levitt
a4eacb5b8f oops, back to dev version number 2018-09-04 10:52:34 -07:00
Noah Levitt
88d3d3b310
why did those tests fail??? (#117)
1.4 for pypi
2018-08-22 14:35:39 -07:00
Noah Levitt
2a2952e810 back to dev version 2018-08-21 15:18:18 -07:00
Noah Levitt
b63661ea70 1.4 for pypi 2018-08-21 15:15:38 -07:00