950 Commits

Author SHA1 Message Date
Noah Levitt
b38fbdcda6 try taking screenshot 3 times, proceed on failure
We've been seeing a lot of this:

2018-02-14 20:06:01,472 13286 CRITICAL BrozzlingThread:44789 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 459, in _browse_page
    behavior_timeout=self._behavior_timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 463, in browse_page
    jpeg_bytes = self.screenshot()
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 565, in screenshot
    timeout=timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 90.5s waiting for: <function Browser.screenshot.<locals>.<lambda> at 0x7f5ab0076a68>

Browser bug, maybe? To work around it, reduce timeout to 45 seconds, try
getting screenshot 3 times, and if it fails proceed anyway, don't queue
the page for recrawling.
2018-02-14 12:15:48 -08:00
Noah Levitt
0faeaab3ac fix attempt for deadlock-ish situation
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Noah Levitt
6086bfe4b4 fix unclosed file warnings when running python in debug mode 2018-02-13 17:07:40 -08:00
Noah Levitt
56e01b9078 give vagrant vm a name in virtualbox 2018-02-13 17:05:45 -08:00
Noah Levitt
e13b458bb9
Merge pull request #89 from internetarchive/ARI-5517
umbraBehavior for thejewishnews.com
2018-02-12 16:13:01 -08:00
Barbara Miller
88076595ba comment tweak 2018-02-12 10:22:41 -08:00
Barbara Miller
668e85be9e umbraBehavior for thejewishnews.com 2018-02-08 13:05:18 -08:00
Noah Levitt
057284c2a7
Merge pull request #88 from nlevitt/block-urls
Block google analytics URLs using new Network.setBlockedURLs API
2018-02-06 16:42:24 -08:00
Noah Levitt
791f77d8a6 add note to readme about browser version 2018-02-06 16:00:14 -08:00
Noah Levitt
506ab0ccc2 check browser version at startup 2018-02-06 15:56:50 -08:00
Vangelis Banos
3b800b583f Reinstate logging 2018-02-06 14:48:30 -08:00
Vangelis Banos
e48ad46a63 Fix typo and block legacy google-analytics.com/ga.js 2018-02-06 14:47:01 -08:00
Vangelis Banos
f54d62ea40 Use Network.setBlockedUrls instead of Debugger to block URLs 2018-02-06 14:47:01 -08:00
Noah Levitt
fc000ff515 bump dev version after PR merge 2018-02-06 12:14:53 -08:00
jkafader
07b961efaf
Merge pull request #85 from nlevitt/claim-batches
WIP: claim sites to brozzle in batches to reduce contention over sites table
2018-02-06 12:01:30 -08:00
Noah Levitt
9a0941f1fd Merge branch 'master' into claim-batches
* master:
  back to dev version number
  commit for beta release
  this should fix travis build?
  fix tests
  update brozzler-easy for current warcprox api
  simpleclicks for minutes PDF
2018-02-06 11:46:15 -08:00
Noah Levitt
d36d574e58
Merge pull request #87 from internetarchive/ARI-5294
capture citymedfordwi.civicweb.net minutes PDFs
2018-02-05 13:19:11 -08:00
Noah Levitt
95cbfa96e2 back to dev version number 2018-02-02 16:54:29 -08:00
Noah Levitt
2a0ad6d0de commit for beta release 1.1b12 2018-02-02 16:52:42 -08:00
Noah Levitt
9ba58de292 this should fix travis build? 2018-02-02 16:25:56 -08:00
Noah Levitt
8505720c41 fix tests 2018-02-02 15:11:26 -08:00
Noah Levitt
5331aca33f update brozzler-easy for current warcprox api 2018-02-02 14:28:46 -08:00
Noah Levitt
7962444f09 claim sites to brozzle in batches to reduce contention over sites table 2018-02-02 13:56:24 -08:00
jkafader
a125434563
Merge pull request #83 from nlevitt/fifteen-minutes
lengthen site session brozzling time to 15 minutes
2018-01-29 15:59:16 -08:00
Noah Levitt
64211475c0 lengthen site session brozzling time to 15 minutes
This should reduce contention over the "sites" table, which should help
keep more available browsers busy across the cluster.
2018-01-29 15:34:54 -08:00
Noah Levitt
4d37f88bcb
Merge pull request #75 from galgeek/pageInterstitialShown
log Page.interstitialShown
2018-01-26 16:18:22 -08:00
Noah Levitt
0e17205e17
Merge pull request #82 from vbanos/websock-tcp-nodely
Use TCP_NODELAY in websocket connection to improve performance
2018-01-26 12:14:44 -08:00
Noah Levitt
ba8d5a3740 fix needs_browsing check
correctly handle relative url "location" response header
2018-01-26 11:00:46 -08:00
Noah Levitt
bf5401283e new test test_needs_browsing
currently exposes bug in resolving "location" response header
2018-01-26 10:59:18 -08:00
Noah Levitt
67d5a0e671 increase timeout waiting for screenshot
because we are seeing timeouts on moderately busy machines
2018-01-26 10:19:23 -08:00
Vangelis Banos
3b0d1203c3 Use TCP_NODELAY in websocket connection to improve performance 2018-01-25 22:39:32 +00:00
Barbara Miller
bc21b325d7 simpleclicks for minutes PDF 2018-01-23 11:43:35 -08:00
Noah Levitt
c934759852 pass canonicalized url to youtube-dl
avoids this kind of error:
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 youtube_dl.to_stderr(YoutubeDL.py:514) ERROR: Unable to download webpage: <urlopen error no host given> (caused by URLError('no host given',))
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 root.brozzle_site(worker.py:521) proxy error (site.proxy=wbgrp-svc400.us.archive.org:8002), will try to choose a healthy instance next time site is brozzled: youtube-dl hit apparent proxy error from https:/www.laphil.com/press1718
2018-01-22 14:52:54 -08:00
Noah Levitt
c22e81341a bump version after pull request merge 2018-01-19 15:02:55 -08:00
Noah Levitt
7f78c335e1
--warcprox-auto distribute assigned sites evenly (#78)
--warcprox-auto distribute assigned sites evenly

When running with --warcprox-auto, choose the instance of warcprox with
the least number of assigned sites, instead of the lowest load in the
service registry. In practice we often start brozzling a whole bunch of
sites at approximately the same time, and because it takes time for that
to affect the "load" reported by warcprox instances, sites end up being
distributed very unevenly.
2018-01-19 14:54:33 -08:00
Noah Levitt
9e80a3b0d3
Merge pull request #71 from internetarchive/brofurb
JS class-based generalized behavior
2018-01-18 12:23:18 -08:00
Barbara Miller
2f3f258856 update copyright dates 2018-01-15 19:39:41 -08:00
Barbara Miller
e52ba4c8ef rm default.js 2018-01-15 19:38:15 -08:00
Barbara Miller
93ceeacfd7 rm obsolete 2018-01-15 19:36:32 -08:00
Barbara Miller
2ce9cf41a1 resolve conflicts 2018-01-15 19:34:47 -08:00
Barbara Miller
9aa670ece5 simple multi-selector test with window.scroll 2018-01-15 17:58:10 -08:00
Barbara Miller
7dccc809d0 use shorter interval 2018-01-15 17:58:10 -08:00
Barbara Miller
06a2b5f817 tidied 2018-01-15 17:58:10 -08:00
Barbara Miller
b979372e85 update copyright 2018-01-15 17:58:10 -08:00
Barbara Miller
93a81a4a37 qa simpleIntervalFunc for now 2018-01-15 17:58:10 -08:00
Barbara Miller
b589324a05 add simplerIntervalFunc... 2018-01-15 17:58:10 -08:00
Barbara Miller
f78e1ff710 minor edits 2018-01-15 17:58:10 -08:00
Barbara Miller
d0203ff9eb tweaks post-troubleshooting ARI-5241 2018-01-15 17:58:10 -08:00
Barbara Miller
dd3b041eec class-based generalized behavior 2018-01-15 17:58:10 -08:00
Barbara Miller
34fb4baf00 WIP: class-based generalized behavior 2018-01-15 17:58:10 -08:00