398 Commits

Author SHA1 Message Date
Noah Levitt
9db7744f2c Merge branch 'master' into qa
* master:
  fail quickly if browser dies at startup
2018-11-01 15:57:52 -07:00
Noah Levitt
15610fa990 fail quickly if browser dies at startup
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
27ba877932 Merge branch 'master' into qa
* master:
  handle exceptions extracting links
  fix reported chromium crash by removing argument
  bump version after merge
  remove stray bad logging line
  tests expect outlinks to be a set
  tidy up some comments and docs
  watch pages as outlinks from youtube-dl playlists
  silence youtube-dl's logging, use only our own
  use a thread-local callback in monkey-patched
  skip downloading videos from youtube playlists
  trace-level logging for all the chrome output
2018-10-29 17:45:09 -07:00
Noah Levitt
1073431f76 handle exceptions extracting links
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame.
    at __brzl_compileOutlinks (<anonymous>:4:24)
    at __brzl_compileOutlinks (<anonymous>:10:29)
    at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10

not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908 fix reported chromium crash by removing argument
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00
Noah Levitt
16c56fed5a Merge branch 'master' into qa
* master:
  hopefully fixes lingering ydl concurrency issue
2018-10-11 13:43:06 -07:00
Noah Levitt
1e95441ce7 hopefully fixes lingering ydl concurrency issue
which was causing awfulness like this:
2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/
2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/
2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS
2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM
2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408
2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/
2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice
2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE
2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0
2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104
i.e. pushing the same stitched-up video to a bunch of wrong places :(
2018-10-11 13:40:57 -07:00
Noah Levitt
ff64e32bd3 Merge branch 'master' into qa
* master:
  brozzler-worker log version number at startup
2018-10-11 13:31:47 -07:00
Noah Levitt
e519616f8e brozzler-worker log version number at startup 2018-10-11 13:31:37 -07:00
Noah Levitt
a75632bd95 Merge branch 'master' into qa
* master:
  bump version after merge
  fix another oversight
  ugh. oops
  Revert "add a github PR template for this repo"
  improve performance of brozzler-new-job
2018-09-28 15:27:51 -07:00
Noah Levitt
362a2347b9 bump version after merge 2018-09-28 15:27:40 -07:00
Noah Levitt
87ec0f2f90 Merge branch 'master' into qa
* master:
  bump doublethink dependency
  verbiage tweaks
  safety check and --force for brozzler-purge
  new command brozzler-purge
2018-09-28 11:12:35 -07:00
Noah Levitt
2386e85a37 bump doublethink dependency 2018-09-27 14:25:49 -07:00
Noah Levitt
174178e02e new command brozzler-purge 2018-09-25 14:56:26 -07:00
Barbara Miller
60cfd684b2 Merge branch 'pageInterstitialShown' into qa 2018-09-25 10:30:02 -07:00
Noah Levitt
48bf185746 bump version after merge 2018-09-18 11:08:44 -07:00
Neil Minton
3c7fdeae2c Merge branch 'ari-5777' into qa 2018-09-12 12:07:45 -04:00
Noah Levitt
efb0696833 bump version number after merge 2018-09-06 16:17:59 -07:00
jkafader
8368cd2bcb
Merge pull request #115 from nlevitt/ydl-stitched
Ydl stitched
2018-09-06 16:15:52 -07:00
Noah Levitt
c4fdbe578d Merge branch 'master' into qa
* master:
  oops, back to dev version number
  wait 20 seconds to claim sites if none were avail-
  tweak logging
  why did those tests fail??? (#117)
  Add screenshots
  Add screenshots
  back to dev version
  1.4 for pypi
  explain --warcprox-auto briefly
  vagrant readme fixes (thanks funkyfuture)
  update cryptography dep version
2018-09-04 10:54:26 -07:00
Noah Levitt
a4eacb5b8f oops, back to dev version number 2018-09-04 10:52:34 -07:00
Noah Levitt
88d3d3b310
why did those tests fail??? (#117)
1.4 for pypi
2018-08-22 14:35:39 -07:00
Noah Levitt
2a2952e810 back to dev version 2018-08-21 15:18:18 -07:00
Noah Levitt
b63661ea70 1.4 for pypi 2018-08-21 15:15:38 -07:00
Noah Levitt
eaf7ef74be explain --warcprox-auto briefly 2018-08-17 12:06:04 -07:00
Noah Levitt
8cdc3dee21 Merge branch 'master' into ydl-stitched
* master:
  vagrant readme fixes (thanks funkyfuture)
  update cryptography dep version
2018-08-17 10:34:00 -07:00
Noah Levitt
d19e139101 vagrant readme fixes (thanks funkyfuture) 2018-08-17 10:31:01 -07:00
Noah Levitt
ffa8021968 update cryptography dep version
github tells me there's a vulnerability <2.3
2018-08-16 14:32:03 -07:00
Noah Levitt
cbeba3a6b9 Merge branch 'ydl-stitched' into qa
* ydl-stitched:
  fix failing tests
  test for youtube-dl stitch-up
  add missing imports and fix mimetype issue
  move youtube-dl code into separate file
  push youtube-dl's stitched up videos to warcprox
2018-08-16 12:10:44 -07:00
Noah Levitt
418a3ef20c Merge branch 'master' into qa
* master:
  expose more brozzle-page args
  update pillow dependency to get rid of github vul-
  more readme edits
  reformat readme to 80 columns
  Copy edits to job-conf readme
  bump up heartbeat interval (see comment)
  Copy edits
  back to dev version
  version 1.3 (messed up 1.2)
  setuptools wants README not readme
  back to dev version number
  version 1.2
  bump dev version after merge
  is test_time_limit is failing because of timing?
2018-08-16 12:08:48 -07:00
Noah Levitt
3c27132aaa test for youtube-dl stitch-up 2018-08-15 17:42:53 -07:00
Noah Levitt
39155ebcc5 push youtube-dl's stitched up videos to warcprox
(no tests yet)
2018-08-13 15:40:48 -07:00
Noah Levitt
4e398e1da2 expose more brozzle-page args 2018-08-13 15:38:24 -07:00
Noah Levitt
b44a444dc2 update pillow dependency to get rid of github vul-
nerability warning
2018-07-24 16:37:25 -05:00
Noah Levitt
9d18dc6aeb bump up heartbeat interval (see comment) 2018-07-03 18:35:08 -05:00
Noah Levitt
783fd0ea87 back to dev version 2018-06-25 19:32:27 +00:00
Noah Levitt
bd63908fb9 version 1.3 (messed up 1.2) 2018-06-25 19:30:39 +00:00
Noah Levitt
2780c92569 setuptools wants README not readme 2018-06-25 19:10:57 +00:00
Noah Levitt
032c7d2898 back to dev version number 2018-06-25 12:33:34 -05:00
Noah Levitt
442d02b26a version 1.2 2018-06-25 12:21:00 -05:00
Noah Levitt
196cd555ea bump dev version after merge 2018-06-25 11:44:45 -05:00
Noah Levitt
27bdfb65d2 monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
109d05c59a Merge branch 'master' into qa
* master:
  monkey-patch youtube-dl to short-circuit
2018-06-11 11:11:09 -07:00
Noah Levitt
a90a29968c monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:08:20 -07:00
Noah Levitt
5c34bd3119 Merge branch 'master' into qa
* master:
  lowercase readme.rst
  explain brozzler use of warcprox_meta
  update README copyright date
  bump dev version after PR #102
  these ssurts are strings too
  fix bad copy/paste
  ssurts are strings now
  travis-ci install warcprox from github
  incorporate urlcanon fix
  update warcprox dependency to include recent fixes
  backward compatibility for old scope["surt"]
  missed a spot where is_permitted_by_robots needs monkeying
  handle new chrome cookie db schema
  describe scope rule conditions
  more explication of scoping
  update docs to match new seed ssurt behavior
  ok seriously tests
  fix more tests for new approach sans scope['surt']
  s/max_hops_off_surt/max_hops_off/
  new test of max_hops_off
  rename page.hops_off_surt to page.hops_off
  doublethink had a bug fix
  tests for new approach without scope['surt']
  tests for new approach without of scope['surt']
  WIP add an accept rule instead of modifying surt
  WIP some words on scoping
  WIP starting to flesh out "scoping" section
  WIP some explanation of automatic login
  WIP documentation!
2018-06-01 16:46:32 -07:00
Noah Levitt
62bb540a11 lowercase readme.rst 2018-05-31 18:46:37 +00:00
Noah Levitt
8906037d82 bump dev version after PR #102 2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
complete job configuration documentation
2018-05-16 17:31:27 -07:00