Commit graph

190 commits

Author SHA1 Message Date
Barbara Miller
cbd6f0f90a Merge branch 'insta18q4' into qa 2018-12-13 17:29:36 -08:00
Noah Levitt
b577fe3c36 log browser uncaught exceptions at debug level
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Barbara Miller
b204e9aec1 Merge branch 'service-worker' into qa 2018-11-27 12:58:47 -08:00
Noah Levitt
f63947cfe9 fetch service worker script with proper headers 2018-11-27 12:35:33 -08:00
Barbara Miller
e2b2542d4a handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Barbara Miller
60cfd684b2 Merge branch 'pageInterstitialShown' into qa 2018-09-25 10:30:02 -07:00
Barbara Miller
156ec0caa1 tidier, better exception handling? 2018-09-21 17:46:19 -07:00
Barbara Miller
86193a525b raise exception PageInterstitialShown 2018-09-20 17:22:52 -07:00
Barbara Miller
0e867102a9 Browsing Exception for Page.interstitialShown 2018-09-20 15:41:31 -07:00
Neil Minton
3c7fdeae2c Merge branch 'ari-5777' into qa 2018-09-12 12:07:45 -04:00
Neil Minton
b5c213cef9 Extract srcset values for use in crawling. 2018-09-12 12:04:47 -04:00
Noah Levitt
c4fdbe578d Merge branch 'master' into qa
* master:
  oops, back to dev version number
  wait 20 seconds to claim sites if none were avail-
  tweak logging
  why did those tests fail??? (#117)
  Add screenshots
  Add screenshots
  back to dev version
  1.4 for pypi
  explain --warcprox-auto briefly
  vagrant readme fixes (thanks funkyfuture)
  update cryptography dep version
2018-09-04 10:54:26 -07:00
Noah Levitt
d0f5cd7168 tweak logging 2018-08-31 15:23:48 -07:00
Barbara Miller
98c21d9d1f Merge branch 'ARI-5689' into qa 2018-07-03 14:50:51 -07:00
Barbara Miller
687d51de20 Revert "skip login for fb groups"
This reverts commit 5e1c86421e.
2018-07-03 14:49:58 -07:00
Barbara Miller
5e1c86421e skip login for fb groups 2018-06-27 16:51:38 -07:00
Barbara Miller
1fffaa9eee Merge branch 'ARI-5689' into qa 2018-06-27 16:47:46 -07:00
Barbara Miller
76ec00c930 skip login for fb groups 2018-06-27 16:47:32 -07:00
Noah Levitt
5a0700b297 Merge branch 'max-claimed-sites' into qa
* max-claimed-sites:
  new job setting max_claimed_sites
  move time limit enforcement
  Invalid syntax in WebsockReceiverThread._javascript_dialog_open
  Make Browser._wait_for sleep time a varible
  Send more compact JSON to browser
  Remove google safebrowsing flags
  try to get chromium 64? (#92)
  Add chromium CLI flags to improve capture performance
2018-03-01 17:20:17 -08:00
Noah Levitt
b438cdd33e
Merge pull request #94 from vbanos/json-compact
Send more compact JSON to browser
2018-02-21 09:53:16 -08:00
Vangelis Banos
646faa8ab0 Invalid syntax in WebsockReceiverThread._javascript_dialog_open
Fix `)` position
2018-02-21 07:34:36 +00:00
Vangelis Banos
e2128b42f0 Make Browser._wait_for sleep time a varible
Useful to be able to tweak this value in other apps using `Browser`.
2018-02-18 23:08:51 +00:00
Vangelis Banos
d6c707d941 Send more compact JSON to browser
Use JSON separators without spaces to reduce json size.
Its already used elsewhere in Brozzler but not here.
2018-02-18 19:03:36 +00:00
Noah Levitt
3d12daea06 Merge branch 'master' into qa
* master:
  bump up timeout waiting for websocket connection
  try taking screenshot 3 times, proceed on failure
2018-02-14 12:33:53 -08:00
Noah Levitt
f8c41c5e8d bump up timeout waiting for websocket connection
We've been seeing some of this:

2018-02-14 20:16:44,011 17816 CRITICAL BrozzlingThread:36444 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 447, in _browse_page
    cookie_db=site.get('cookie_db'))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 338, in start
    self._wait_for(lambda: self.websock_thread.is_open, timeout=10)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 11.1s waiting for: <function Browser.start.<locals>.<lambda> at 0x7fb2dc772bd8>

Mostly at startup. Now that brozzler claims sites in batches for
brozzling, we have situations where we start up a whole bunch of
browsers at the same time. That's probably why in some cases they are
slow to establish the websocket connection.
2018-02-14 12:29:51 -08:00
Noah Levitt
b38fbdcda6 try taking screenshot 3 times, proceed on failure
We've been seeing a lot of this:

2018-02-14 20:06:01,472 13286 CRITICAL BrozzlingThread:44789 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 459, in _browse_page
    behavior_timeout=self._behavior_timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 463, in browse_page
    jpeg_bytes = self.screenshot()
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 565, in screenshot
    timeout=timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 90.5s waiting for: <function Browser.screenshot.<locals>.<lambda> at 0x7f5ab0076a68>

Browser bug, maybe? To work around it, reduce timeout to 45 seconds, try
getting screenshot 3 times, and if it fails proceed anyway, don't queue
the page for recrawling.
2018-02-14 12:15:48 -08:00
Noah Levitt
df0717d072 Merge branch 'master' into qa
* master:
  fix attempt for deadlock-ish situation
  fix unclosed file warnings when running python in debug mode
  give vagrant vm a name in virtualbox
  add note to readme about browser version
  check browser version at startup
  Reinstate logging
  Fix typo and block legacy google-analytics.com/ga.js
  Use Network.setBlockedUrls instead of Debugger to block URLs
  bump dev version after PR merge
  back to dev version number
  commit for beta release
  this should fix travis build?
  fix tests
  update brozzler-easy for current warcprox api
  claim sites to brozzle in batches to reduce contention over sites table
  lengthen site session brozzling time to 15 minutes
  fix needs_browsing check
  new test test_needs_browsing
  increase timeout waiting for screenshot
  Use TCP_NODELAY in websocket connection to improve performance
2018-02-13 17:10:10 -08:00
Vangelis Banos
3b800b583f Reinstate logging 2018-02-06 14:48:30 -08:00
Vangelis Banos
e48ad46a63 Fix typo and block legacy google-analytics.com/ga.js 2018-02-06 14:47:01 -08:00
Vangelis Banos
f54d62ea40 Use Network.setBlockedUrls instead of Debugger to block URLs 2018-02-06 14:47:01 -08:00
Noah Levitt
7962444f09 claim sites to brozzle in batches to reduce contention over sites table 2018-02-02 13:56:24 -08:00
Barbara Miller
54f243cd92 Merge branch 'ARI-5379' into qa 2018-01-31 13:27:43 -08:00
Barbara Miller
647d7e70b3 use custom timeout always 2018-01-31 13:27:25 -08:00
Barbara Miller
d57c45a3be Merge branch 'ARI-5379' into qa 2018-01-31 13:22:20 -08:00
Barbara Miller
96e499887b check for None 2018-01-31 13:21:59 -08:00
Barbara Miller
011cdde7ce Merge branch 'ARI-5379' into qa 2018-01-29 15:27:07 -08:00
Barbara Miller
92c137f402 behavior_timeout_custom (not timeout_from_behavior) 2018-01-29 15:26:23 -08:00
Barbara Miller
d3088c6418 Merge branch 'ARI-5379' into qa 2018-01-26 17:26:25 -08:00
Barbara Miller
70af801da1 configurable behavior timeout 2018-01-26 17:23:51 -08:00
Noah Levitt
4d37f88bcb
Merge pull request #75 from galgeek/pageInterstitialShown
log Page.interstitialShown
2018-01-26 16:18:22 -08:00
Noah Levitt
0e17205e17
Merge pull request #82 from vbanos/websock-tcp-nodely
Use TCP_NODELAY in websocket connection to improve performance
2018-01-26 12:14:44 -08:00
Noah Levitt
67d5a0e671 increase timeout waiting for screenshot
because we are seeing timeouts on moderately busy machines
2018-01-26 10:19:23 -08:00
Vangelis Banos
3b0d1203c3 Use TCP_NODELAY in websocket connection to improve performance 2018-01-25 22:39:32 +00:00
Barbara Miller
2773c4ab6f Merge branch 'brofurb' into qa 2018-01-15 19:52:38 -08:00
Barbara Miller
5901434c2b Merge branch 'pageInterstitialShown' into qa 2018-01-08 08:27:45 -08:00
Barbara Miller
37c5720729 log Page.interstitialShown 2018-01-08 08:26:44 -08:00
Vangelis Banos
dacfba330c Configurable JS templates location
Brozzler has hard-coded the JS templates logic in  ``brozzler/behaviors.yaml``
and ``brozzler/js-templates/`` locations. With this change, you can use
the optional ``behaviors_dir`` ``browser.browse_page`` parameter to set a
custom location and use any potential JS behaviors.
2018-01-04 17:37:02 +00:00
Noah Levitt
cc6297ef60 wait for ack from browser setting request headers
guessing this might fix the issue where some requests are missing the
warcprox-meta header, which results in their being written to the wrong
warc
2017-12-27 14:43:26 -08:00
Noah Levitt
1dea1f3f93 use Accept-Encoding: gzip instead of identity
fixes twitter scrolling, which had been giving "Loading seems to be
taking a while." error message
2017-12-27 14:22:24 -08:00
Barbara Miller
f77144e2dc Merge branch 'behavior-refactor' into qa 2017-10-04 20:23:57 -07:00