368 Commits

Author SHA1 Message Date
Noah Levitt
cbeba3a6b9 Merge branch 'ydl-stitched' into qa
* ydl-stitched:
  fix failing tests
  test for youtube-dl stitch-up
  add missing imports and fix mimetype issue
  move youtube-dl code into separate file
  push youtube-dl's stitched up videos to warcprox
2018-08-16 12:10:44 -07:00
Noah Levitt
418a3ef20c Merge branch 'master' into qa
* master:
  expose more brozzle-page args
  update pillow dependency to get rid of github vul-
  more readme edits
  reformat readme to 80 columns
  Copy edits to job-conf readme
  bump up heartbeat interval (see comment)
  Copy edits
  back to dev version
  version 1.3 (messed up 1.2)
  setuptools wants README not readme
  back to dev version number
  version 1.2
  bump dev version after merge
  is test_time_limit is failing because of timing?
2018-08-16 12:08:48 -07:00
Noah Levitt
3c27132aaa test for youtube-dl stitch-up 2018-08-15 17:42:53 -07:00
Noah Levitt
39155ebcc5 push youtube-dl's stitched up videos to warcprox
(no tests yet)
2018-08-13 15:40:48 -07:00
Noah Levitt
4e398e1da2 expose more brozzle-page args 2018-08-13 15:38:24 -07:00
Noah Levitt
b44a444dc2 update pillow dependency to get rid of github vul-
nerability warning
2018-07-24 16:37:25 -05:00
Noah Levitt
9d18dc6aeb bump up heartbeat interval (see comment) 2018-07-03 18:35:08 -05:00
Noah Levitt
783fd0ea87 back to dev version 2018-06-25 19:32:27 +00:00
Noah Levitt
bd63908fb9 version 1.3 (messed up 1.2) 2018-06-25 19:30:39 +00:00
Noah Levitt
2780c92569 setuptools wants README not readme 2018-06-25 19:10:57 +00:00
Noah Levitt
032c7d2898 back to dev version number 2018-06-25 12:33:34 -05:00
Noah Levitt
442d02b26a version 1.2 2018-06-25 12:21:00 -05:00
Noah Levitt
196cd555ea bump dev version after merge 2018-06-25 11:44:45 -05:00
Noah Levitt
27bdfb65d2 monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
109d05c59a Merge branch 'master' into qa
* master:
  monkey-patch youtube-dl to short-circuit
2018-06-11 11:11:09 -07:00
Noah Levitt
a90a29968c monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:08:20 -07:00
Noah Levitt
5c34bd3119 Merge branch 'master' into qa
* master:
  lowercase readme.rst
  explain brozzler use of warcprox_meta
  update README copyright date
  bump dev version after PR #102
  these ssurts are strings too
  fix bad copy/paste
  ssurts are strings now
  travis-ci install warcprox from github
  incorporate urlcanon fix
  update warcprox dependency to include recent fixes
  backward compatibility for old scope["surt"]
  missed a spot where is_permitted_by_robots needs monkeying
  handle new chrome cookie db schema
  describe scope rule conditions
  more explication of scoping
  update docs to match new seed ssurt behavior
  ok seriously tests
  fix more tests for new approach sans scope['surt']
  s/max_hops_off_surt/max_hops_off/
  new test of max_hops_off
  rename page.hops_off_surt to page.hops_off
  doublethink had a bug fix
  tests for new approach without scope['surt']
  tests for new approach without of scope['surt']
  WIP add an accept rule instead of modifying surt
  WIP some words on scoping
  WIP starting to flesh out "scoping" section
  WIP some explanation of automatic login
  WIP documentation!
2018-06-01 16:46:32 -07:00
Noah Levitt
62bb540a11 lowercase readme.rst 2018-05-31 18:46:37 +00:00
Noah Levitt
8906037d82 bump dev version after PR #102 2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9 update warcprox dependency to include recent fixes 2018-05-16 14:26:51 -07:00
Noah Levitt
a8de9b70d1 handle new chrome cookie db schema 2018-05-15 11:41:02 -07:00
Noah Levitt
60f2b99cc0 doublethink had a bug fix 2018-05-14 15:38:28 -07:00
Barbara Miller
d371db8166 Merge branch 'ARI-5617' into qa 2018-03-13 16:34:21 -07:00
Noah Levitt
55701ae373 bump version number after merge 2018-03-08 16:49:28 -08:00
Noah Levitt
f9834ca77d bump after merge 2018-03-02 11:51:50 -08:00
Noah Levitt
3d12daea06 Merge branch 'master' into qa
* master:
  bump up timeout waiting for websocket connection
  try taking screenshot 3 times, proceed on failure
2018-02-14 12:33:53 -08:00
Noah Levitt
f8c41c5e8d bump up timeout waiting for websocket connection
We've been seeing some of this:

2018-02-14 20:16:44,011 17816 CRITICAL BrozzlingThread:36444 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 447, in _browse_page
    cookie_db=site.get('cookie_db'))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 338, in start
    self._wait_for(lambda: self.websock_thread.is_open, timeout=10)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 11.1s waiting for: <function Browser.start.<locals>.<lambda> at 0x7fb2dc772bd8>

Mostly at startup. Now that brozzler claims sites in batches for
brozzling, we have situations where we start up a whole bunch of
browsers at the same time. That's probably why in some cases they are
slow to establish the websocket connection.
2018-02-14 12:29:51 -08:00
Noah Levitt
df0717d072 Merge branch 'master' into qa
* master:
  fix attempt for deadlock-ish situation
  fix unclosed file warnings when running python in debug mode
  give vagrant vm a name in virtualbox
  add note to readme about browser version
  check browser version at startup
  Reinstate logging
  Fix typo and block legacy google-analytics.com/ga.js
  Use Network.setBlockedUrls instead of Debugger to block URLs
  bump dev version after PR merge
  back to dev version number
  commit for beta release
  this should fix travis build?
  fix tests
  update brozzler-easy for current warcprox api
  claim sites to brozzle in batches to reduce contention over sites table
  lengthen site session brozzling time to 15 minutes
  fix needs_browsing check
  new test test_needs_browsing
  increase timeout waiting for screenshot
  Use TCP_NODELAY in websocket connection to improve performance
2018-02-13 17:10:10 -08:00
Noah Levitt
0faeaab3ac fix attempt for deadlock-ish situation
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Noah Levitt
fc000ff515 bump dev version after PR merge 2018-02-06 12:14:53 -08:00
Noah Levitt
95cbfa96e2 back to dev version number 2018-02-02 16:54:29 -08:00
Noah Levitt
2a0ad6d0de commit for beta release 2018-02-02 16:52:42 -08:00
Noah Levitt
8505720c41 fix tests 2018-02-02 15:11:26 -08:00
Noah Levitt
5331aca33f update brozzler-easy for current warcprox api 2018-02-02 14:28:46 -08:00
Noah Levitt
ba8d5a3740 fix needs_browsing check
correctly handle relative url "location" response header
2018-01-26 11:00:46 -08:00
Noah Levitt
67d5a0e671 increase timeout waiting for screenshot
because we are seeing timeouts on moderately busy machines
2018-01-26 10:19:23 -08:00
Barbara Miller
455014a631 Merge branch 'ARI-5294' into qa 2018-01-23 11:47:57 -08:00
Noah Levitt
c934759852 pass canonicalized url to youtube-dl
avoids this kind of error:
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 youtube_dl.to_stderr(YoutubeDL.py:514) ERROR: Unable to download webpage: <urlopen error no host given> (caused by URLError('no host given',))
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 root.brozzle_site(worker.py:521) proxy error (site.proxy=wbgrp-svc400.us.archive.org:8002), will try to choose a healthy instance next time site is brozzled: youtube-dl hit apparent proxy error from https:/www.laphil.com/press1718
2018-01-22 14:52:54 -08:00
Noah Levitt
4ddd76f542 pass canonicalized url to youtube-dl
avoids this kind of error:
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 youtube_dl.to_stderr(YoutubeDL.py:514) ERROR: Unable to download webpage: <urlopen error no host given> (caused by URLError('no host given',))
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 root.brozzle_site(worker.py:521) proxy error (site.proxy=wbgrp-svc400.us.archive.org:8002), will try to choose a healthy instance next time site is brozzled: youtube-dl hit apparent proxy error from https:/www.laphil.com/press1718
2018-01-22 12:47:26 -08:00
Noah Levitt
c22e81341a bump version after pull request merge 2018-01-19 15:02:55 -08:00
Noah Levitt
503771d653 set a timeout on warcprox_write_record request 2017-12-27 15:52:55 -08:00
Noah Levitt
cc6297ef60 wait for ack from browser setting request headers
guessing this might fix the issue where some requests are missing the
warcprox-meta header, which results in their being written to the wrong
warc
2017-12-27 14:43:26 -08:00
Noah Levitt
1dea1f3f93 use Accept-Encoding: gzip instead of identity
fixes twitter scrolling, which had been giving "Loading seems to be
taking a while." error message
2017-12-27 14:22:24 -08:00
Noah Levitt
daecb4f59e fix brozzler-list-sites --site=SITE_ID 2017-12-21 17:16:41 -08:00
Noah Levitt
7ff99266ea quiet down the logging 2017-12-15 15:57:36 -08:00
Noah Levitt
df6615cc2c avoid rethinkdb.errors.ReqlDriverError: Query size 2017-12-15 15:55:10 -08:00
Noah Levitt
196cd2c5eb will this fix the travis build? 2017-11-08 17:41:39 -08:00
Noah Levitt
d40390f938 cryptography lib version 2.1.1 is causing problems 2017-10-16 10:52:09 -07:00