338 Commits

Author SHA1 Message Date
Noah Levitt
f9834ca77d bump after merge 2018-03-02 11:51:50 -08:00
Noah Levitt
f8c41c5e8d bump up timeout waiting for websocket connection
We've been seeing some of this:

2018-02-14 20:16:44,011 17816 CRITICAL BrozzlingThread:36444 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 447, in _browse_page
    cookie_db=site.get('cookie_db'))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 338, in start
    self._wait_for(lambda: self.websock_thread.is_open, timeout=10)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 11.1s waiting for: <function Browser.start.<locals>.<lambda> at 0x7fb2dc772bd8>

Mostly at startup. Now that brozzler claims sites in batches for
brozzling, we have situations where we start up a whole bunch of
browsers at the same time. That's probably why in some cases they are
slow to establish the websocket connection.
2018-02-14 12:29:51 -08:00
Noah Levitt
0faeaab3ac fix attempt for deadlock-ish situation
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Noah Levitt
fc000ff515 bump dev version after PR merge 2018-02-06 12:14:53 -08:00
Noah Levitt
95cbfa96e2 back to dev version number 2018-02-02 16:54:29 -08:00
Noah Levitt
2a0ad6d0de commit for beta release 2018-02-02 16:52:42 -08:00
Noah Levitt
8505720c41 fix tests 2018-02-02 15:11:26 -08:00
Noah Levitt
5331aca33f update brozzler-easy for current warcprox api 2018-02-02 14:28:46 -08:00
Noah Levitt
ba8d5a3740 fix needs_browsing check
correctly handle relative url "location" response header
2018-01-26 11:00:46 -08:00
Noah Levitt
67d5a0e671 increase timeout waiting for screenshot
because we are seeing timeouts on moderately busy machines
2018-01-26 10:19:23 -08:00
Noah Levitt
c934759852 pass canonicalized url to youtube-dl
avoids this kind of error:
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 youtube_dl.to_stderr(YoutubeDL.py:514) ERROR: Unable to download webpage: <urlopen error no host given> (caused by URLError('no host given',))
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 root.brozzle_site(worker.py:521) proxy error (site.proxy=wbgrp-svc400.us.archive.org:8002), will try to choose a healthy instance next time site is brozzled: youtube-dl hit apparent proxy error from https:/www.laphil.com/press1718
2018-01-22 14:52:54 -08:00
Noah Levitt
c22e81341a bump version after pull request merge 2018-01-19 15:02:55 -08:00
Noah Levitt
503771d653 set a timeout on warcprox_write_record request 2017-12-27 15:52:55 -08:00
Noah Levitt
cc6297ef60 wait for ack from browser setting request headers
guessing this might fix the issue where some requests are missing the
warcprox-meta header, which results in their being written to the wrong
warc
2017-12-27 14:43:26 -08:00
Noah Levitt
1dea1f3f93 use Accept-Encoding: gzip instead of identity
fixes twitter scrolling, which had been giving "Loading seems to be
taking a while." error message
2017-12-27 14:22:24 -08:00
Noah Levitt
daecb4f59e fix brozzler-list-sites --site=SITE_ID 2017-12-21 17:16:41 -08:00
Noah Levitt
7ff99266ea quiet down the logging 2017-12-15 15:57:36 -08:00
Noah Levitt
df6615cc2c avoid rethinkdb.errors.ReqlDriverError: Query size 2017-12-15 15:55:10 -08:00
Noah Levitt
196cd2c5eb will this fix the travis build? 2017-11-08 17:41:39 -08:00
Noah Levitt
d40390f938 cryptography lib version 2.1.1 is causing problems 2017-10-16 10:52:09 -07:00
Noah Levitt
ec847e48bc fix problem where each hashtag visited causes a page load if page url redirects 2017-09-27 14:11:20 -07:00
Noah Levitt
384c877e9a new test exposing problem where each hashtag visited causes a page load, if page redirects 2017-09-27 14:08:28 -07:00
Noah Levitt
bf250194b4 bump dev version number after some PR merges 2017-08-01 12:04:56 -07:00
Noah Levitt
c77f4e4249 dev version bump 2017-07-06 17:19:53 -07:00
Noah Levitt
051e299a80 fix "local variable 'start' referenced before assignment" 2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30 enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness 2017-06-26 18:00:32 -07:00
Noah Levitt
8ef7972ace make sure youtube-dl progress thing can't derail youtube-dl operation 2017-06-26 16:10:40 -07:00
Noah Levitt
caee2787b0 have brozzler-list-sites --active use the index 2017-06-24 01:05:19 +00:00
Noah Levitt
35babeb01b make youtube-dl prefer unsegmented videos 2017-06-23 15:19:30 -07:00
Noah Levitt
405c5725e4 restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site 2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646 disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue 2017-06-19 11:21:02 -07:00
Noah Levitt
7ae22381ef bump version number for pull request 2017-06-12 15:42:49 -07:00
Noah Levitt
193ac43797 back to dev version number 2017-06-08 17:33:29 -07:00
Noah Levitt
44f74066cf 1.1b11 2017-06-08 17:30:24 -07:00
Noah Levitt
27040fd8b7 mini fix 2017-06-08 17:29:51 -07:00
Noah Levitt
02e1c88fac oops bump version 2017-06-07 13:08:23 -07:00
Noah Levitt
65adc11d95 oops, should have bumped version number after merging pull requests 2017-06-07 08:51:21 -07:00
Noah Levitt
f2227e6759 have travis-ci test against python 3.5 and 3.6 too 2017-05-26 13:28:00 -07:00
Noah Levitt
bdc0badec3 rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!) 2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c remove stray logging 2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6 use "ttl" for updated doublethink svc reg api 2017-05-23 11:33:04 -07:00
Noah Levitt
89e7c8b079 fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called 2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97 improve thread_raise() so that the new tests pass
1. If thread is not currently accepting exceptions, queue it and raise if and
   when it does start accepting them. This fixes problem of thread_raise
   exceptions being ignored when raised just before the target thread starts
   accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
   thread in quick succession by ensuring that only one is actually raised for
   a given `with` block. This type of occurrence had been putting brozzler into
   a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
d514eaec15 even more, better failing tests for thread_raise 2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87 failing test for forthcoming behavior of thread_raise 2017-05-15 16:20:20 -07:00
Noah Levitt
60c5a7c1c4 recognize ConnectionError (of which ConnectionResetError is a subclass) in _warcprox_write_record as a proxy error 2017-05-12 10:03:53 -07:00
Noah Levitt
b4bf17df9b do a better job of making sure to shut down the browser when brozzle-page is killed 2017-05-03 16:43:31 -07:00
Noah Levitt
9d4cbbf6eb handle another rethinkdb outage corner case 2017-05-01 14:12:43 -07:00
Noah Levitt
389db01458 BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls 2017-05-01 13:46:19 -07:00
Noah Levitt
52433ade78 re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker 2017-05-01 13:00:04 -07:00