Noah Levitt
62bb540a11
lowercase readme.rst
2018-05-31 18:46:37 +00:00
Noah Levitt
8906037d82
bump dev version after PR #102
2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
...
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
ac735639ff
incorporate urlcanon fix
2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9
update warcprox dependency to include recent fixes
2018-05-16 14:26:51 -07:00
Noah Levitt
a8de9b70d1
handle new chrome cookie db schema
2018-05-15 11:41:02 -07:00
Noah Levitt
60f2b99cc0
doublethink had a bug fix
2018-05-14 15:38:28 -07:00
Noah Levitt
55701ae373
bump version number after merge
2018-03-08 16:49:28 -08:00
Noah Levitt
f9834ca77d
bump after merge
2018-03-02 11:51:50 -08:00
Noah Levitt
f8c41c5e8d
bump up timeout waiting for websocket connection
...
We've been seeing some of this:
2018-02-14 20:16:44,011 17816 CRITICAL BrozzlingThread:36444 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
enable_youtube_dl=not self._skip_youtube_dl)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
on_request)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 447, in _browse_page
cookie_db=site.get('cookie_db'))
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 338, in start
self._wait_for(lambda: self.websock_thread.is_open, timeout=10)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 11.1s waiting for: <function Browser.start.<locals>.<lambda> at 0x7fb2dc772bd8>
Mostly at startup. Now that brozzler claims sites in batches for
brozzling, we have situations where we start up a whole bunch of
browsers at the same time. That's probably why in some cases they are
slow to establish the websocket connection.
2018-02-14 12:29:51 -08:00
Noah Levitt
0faeaab3ac
fix attempt for deadlock-ish situation
...
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Noah Levitt
fc000ff515
bump dev version after PR merge
2018-02-06 12:14:53 -08:00
Noah Levitt
95cbfa96e2
back to dev version number
2018-02-02 16:54:29 -08:00
Noah Levitt
2a0ad6d0de
commit for beta release
2018-02-02 16:52:42 -08:00
Noah Levitt
8505720c41
fix tests
2018-02-02 15:11:26 -08:00
Noah Levitt
5331aca33f
update brozzler-easy for current warcprox api
2018-02-02 14:28:46 -08:00
Noah Levitt
ba8d5a3740
fix needs_browsing check
...
correctly handle relative url "location" response header
2018-01-26 11:00:46 -08:00
Noah Levitt
67d5a0e671
increase timeout waiting for screenshot
...
because we are seeing timeouts on moderately busy machines
2018-01-26 10:19:23 -08:00
Noah Levitt
c934759852
pass canonicalized url to youtube-dl
...
avoids this kind of error:
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 youtube_dl.to_stderr(YoutubeDL.py:514) ERROR: Unable to download webpage: <urlopen error no host given> (caused by URLError('no host given',))
wbgrp-svc294 2018-01-19 21:04:43,973 648 ERROR BrozzlingThread:39295 root.brozzle_site(worker.py:521) proxy error (site.proxy=wbgrp-svc400.us.archive.org:8002), will try to choose a healthy instance next time site is brozzled: youtube-dl hit apparent proxy error from https:/www.laphil.com/press1718
2018-01-22 14:52:54 -08:00
Noah Levitt
c22e81341a
bump version after pull request merge
2018-01-19 15:02:55 -08:00
Noah Levitt
503771d653
set a timeout on warcprox_write_record request
2017-12-27 15:52:55 -08:00
Noah Levitt
cc6297ef60
wait for ack from browser setting request headers
...
guessing this might fix the issue where some requests are missing the
warcprox-meta header, which results in their being written to the wrong
warc
2017-12-27 14:43:26 -08:00
Noah Levitt
1dea1f3f93
use Accept-Encoding: gzip instead of identity
...
fixes twitter scrolling, which had been giving "Loading seems to be
taking a while." error message
2017-12-27 14:22:24 -08:00
Noah Levitt
daecb4f59e
fix brozzler-list-sites --site=SITE_ID
2017-12-21 17:16:41 -08:00
Noah Levitt
7ff99266ea
quiet down the logging
2017-12-15 15:57:36 -08:00
Noah Levitt
df6615cc2c
avoid rethinkdb.errors.ReqlDriverError: Query size
2017-12-15 15:55:10 -08:00
Noah Levitt
196cd2c5eb
will this fix the travis build?
2017-11-08 17:41:39 -08:00
Noah Levitt
d40390f938
cryptography lib version 2.1.1 is causing problems
2017-10-16 10:52:09 -07:00
Noah Levitt
ec847e48bc
fix problem where each hashtag visited causes a page load if page url redirects
2017-09-27 14:11:20 -07:00
Noah Levitt
384c877e9a
new test exposing problem where each hashtag visited causes a page load, if page redirects
2017-09-27 14:08:28 -07:00
Noah Levitt
bf250194b4
bump dev version number after some PR merges
2017-08-01 12:04:56 -07:00
Noah Levitt
c77f4e4249
dev version bump
2017-07-06 17:19:53 -07:00
Noah Levitt
051e299a80
fix "local variable 'start' referenced before assignment"
2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30
enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness
2017-06-26 18:00:32 -07:00
Noah Levitt
8ef7972ace
make sure youtube-dl progress thing can't derail youtube-dl operation
2017-06-26 16:10:40 -07:00
Noah Levitt
caee2787b0
have brozzler-list-sites --active use the index
2017-06-24 01:05:19 +00:00
Noah Levitt
35babeb01b
make youtube-dl prefer unsegmented videos
2017-06-23 15:19:30 -07:00
Noah Levitt
405c5725e4
restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site
2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646
disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue
2017-06-19 11:21:02 -07:00
Noah Levitt
7ae22381ef
bump version number for pull request
2017-06-12 15:42:49 -07:00
Noah Levitt
193ac43797
back to dev version number
2017-06-08 17:33:29 -07:00
Noah Levitt
44f74066cf
1.1b11
2017-06-08 17:30:24 -07:00
Noah Levitt
27040fd8b7
mini fix
2017-06-08 17:29:51 -07:00
Noah Levitt
02e1c88fac
oops bump version
2017-06-07 13:08:23 -07:00
Noah Levitt
65adc11d95
oops, should have bumped version number after merging pull requests
2017-06-07 08:51:21 -07:00
Noah Levitt
f2227e6759
have travis-ci test against python 3.5 and 3.6 too
2017-05-26 13:28:00 -07:00
Noah Levitt
bdc0badec3
rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!)
2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c
remove stray logging
2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6
use "ttl" for updated doublethink svc reg api
2017-05-23 11:33:04 -07:00
Noah Levitt
89e7c8b079
fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called
2017-05-16 15:47:18 -07:00