Noah Levitt
405c5725e4
restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site
2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646
disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue
2017-06-19 11:21:02 -07:00
Noah Levitt
7ae22381ef
bump version number for pull request
2017-06-12 15:42:49 -07:00
Noah Levitt
193ac43797
back to dev version number
2017-06-08 17:33:29 -07:00
Noah Levitt
44f74066cf
1.1b11
2017-06-08 17:30:24 -07:00
Noah Levitt
27040fd8b7
mini fix
2017-06-08 17:29:51 -07:00
Noah Levitt
02e1c88fac
oops bump version
2017-06-07 13:08:23 -07:00
Noah Levitt
65adc11d95
oops, should have bumped version number after merging pull requests
2017-06-07 08:51:21 -07:00
Noah Levitt
f2227e6759
have travis-ci test against python 3.5 and 3.6 too
2017-05-26 13:28:00 -07:00
Noah Levitt
bdc0badec3
rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!)
2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c
remove stray logging
2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6
use "ttl" for updated doublethink svc reg api
2017-05-23 11:33:04 -07:00
Noah Levitt
89e7c8b079
fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called
2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97
improve thread_raise() so that the new tests pass
...
1. If thread is not currently accepting exceptions, queue it and raise if and
when it does start accepting them. This fixes problem of thread_raise
exceptions being ignored when raised just before the target thread starts
accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
thread in quick succession by ensuring that only one is actually raised for
a given `with` block. This type of occurrence had been putting brozzler into
a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
d514eaec15
even more, better failing tests for thread_raise
2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87
failing test for forthcoming behavior of thread_raise
2017-05-15 16:20:20 -07:00
Noah Levitt
60c5a7c1c4
recognize ConnectionError (of which ConnectionResetError is a subclass) in _warcprox_write_record as a proxy error
2017-05-12 10:03:53 -07:00
Noah Levitt
b4bf17df9b
do a better job of making sure to shut down the browser when brozzle-page is killed
2017-05-03 16:43:31 -07:00
Noah Levitt
9d4cbbf6eb
handle another rethinkdb outage corner case
2017-05-01 14:12:43 -07:00
Noah Levitt
389db01458
BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls
2017-05-01 13:46:19 -07:00
Noah Levitt
52433ade78
re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
2017-05-01 13:00:04 -07:00
Noah Levitt
dcf4811470
Merge branch 'master' into safe-thread-raise
2017-04-24 20:06:37 -07:00
Noah Levitt
f140e5bdbd
allow this stupid test to fail
2017-04-21 12:17:11 -07:00
Noah Levitt
ba519d7288
improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
2017-04-20 18:04:17 -07:00
Noah Levitt
7706bab8b8
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
2017-04-20 17:08:16 -07:00
Noah Levitt
426916a238
need warcprox in python path for travis tests now
2017-04-18 18:10:18 -07:00
Noah Levitt
8256a34b4f
implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker
2017-04-18 17:54:12 -07:00
Noah Levitt
5603ff5380
have _warcprox_write_record also raise ProxyError when appropriate, and test this
2017-04-18 16:58:51 -07:00
Noah Levitt
ac972d399f
fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth
2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363
raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
2017-04-17 18:15:22 -07:00
Noah Levitt
349b41ab32
raise new exception brozzler.ProxyError in case of proxy error browsing a page
2017-04-17 18:14:02 -07:00
Noah Levitt
87a7301f4d
make brozzle-page respect --proxy (no test for this!)
2017-04-17 18:11:09 -07:00
Noah Levitt
0e90950de2
oops, version bump for previous commit
2017-04-17 18:10:56 -07:00
Noah Levitt
df7734f2ca
new command line utility brozzler-stop-crawl, with tests
2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960
parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run
2017-04-14 11:46:26 -07:00
Noah Levitt
b3cf746f53
stupid version number bump
2017-04-05 17:01:52 -07:00
Noah Levitt
62917a6f1a
Revert "bump version number for last pull request"
...
This reverts commit d192fc269eddeb8b06888e95bb6e4a6639e34415.
2017-04-05 17:01:06 -07:00
Noah Levitt
d192fc269e
bump version number for last pull request
2017-04-05 16:15:24 -07:00
Noah Levitt
5bcd10c228
extract area/@href links, and add test for outlink extraction
2017-04-05 12:09:48 -07:00
Noah Levitt
d4d3ef4fd3
ugh fix version number
2017-03-30 17:53:36 -07:00
Noah Levitt
125d77b8c4
consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin
2017-03-29 18:49:04 -07:00
Noah Levitt
3d47805ec1
new model for crawling hashtags, each one is no longer a top-level page
2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95
remove some vestiges of old proxy stuff
2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef
new test of frontier.seed_page
2017-03-24 15:45:40 -07:00
Noah Levitt
0e35de43b6
actually respect --proxy and --warcprox-auto options to brozzler-worker
2017-03-24 22:27:52 +00:00
Noah Levitt
934190084c
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3 ), and enables warcprox features if so.
2017-03-24 13:55:23 -07:00
Noah Levitt
9a2f181eb6
back to a dev version number
2017-03-22 16:12:39 -07:00
Noah Levitt
613dca29dc
1.1b10 since 1.1b9 has bugs :(
2017-03-22 16:11:26 -07:00
Noah Levitt
4ba25db684
ugh, avoid infinite recursion
2017-03-22 15:53:58 -07:00
Noah Levitt
34bb64297f
fix frontier tests now that enable_warcprox_features is simply omitted by default
2017-03-22 15:46:12 -07:00