Noah Levitt
f26d711a89
new job setting max_claimed_sites
...
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs. For AITFIVE-1578.
2018-03-01 17:17:54 -08:00
Noah Levitt
d7512fbeb6
move time limit enforcement
...
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Noah Levitt
9a0941f1fd
Merge branch 'master' into claim-batches
...
* master:
back to dev version number
commit for beta release
this should fix travis build?
fix tests
update brozzler-easy for current warcprox api
simpleclicks for minutes PDF
2018-02-06 11:46:15 -08:00
Noah Levitt
8505720c41
fix tests
2018-02-02 15:11:26 -08:00
Noah Levitt
7962444f09
claim sites to brozzle in batches to reduce contention over sites table
2018-02-02 13:56:24 -08:00
Noah Levitt
bf5401283e
new test test_needs_browsing
...
currently exposes bug in resolving "location" response header
2018-01-26 10:59:18 -08:00
Noah Levitt
7f78c335e1
--warcprox-auto distribute assigned sites evenly ( #78 )
...
--warcprox-auto distribute assigned sites evenly
When running with --warcprox-auto, choose the instance of warcprox with
the least number of assigned sites, instead of the lowest load in the
service registry. In practice we often start brozzling a whole bunch of
sites at approximately the same time, and because it takes time for that
to affect the "load" reported by warcprox instances, sites end up being
distributed very unevenly.
2018-01-19 14:54:33 -08:00
Noah Levitt
b81cc4eb0a
remove stray pdb line
2017-11-08 17:03:54 -08:00
Noah Levitt
133726e942
test a real-ish mpd
2017-11-08 17:01:27 -08:00
Barbara Miller
e8fdf84db8
add test--not a Video
2017-11-07 17:23:51 -08:00
Daniel Bicho
c4fa612547
fix some errors in test_resume_job
2017-10-17 10:33:26 +01:00
Daniel Bicho
bb98a43c8c
fix and test both job stop request and site stop requests
2017-10-16 11:46:35 +01:00
Daniel Bicho
8aa10962bc
test resume_job adding a simulation of a crawl job stopped and then resumed.
2017-10-15 19:11:46 +01:00
Daniel Bicho
378c097c29
add verification change to test_resume_job
2017-10-13 12:13:51 +01:00
Noah Levitt
384c877e9a
new test exposing problem where each hashtag visited causes a page load, if page redirects
2017-09-27 14:08:28 -07:00
Noah Levitt
3385d727ac
minimally update test_time_limit for new time accounting
2017-06-26 17:57:50 -07:00
Noah Levitt
405c5725e4
restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site
2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646
disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue
2017-06-19 11:21:02 -07:00
Noah Levitt
d514eaec15
even more, better failing tests for thread_raise
2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87
failing test for forthcoming behavior of thread_raise
2017-05-15 16:20:20 -07:00
Noah Levitt
52433ade78
re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
2017-05-01 13:00:04 -07:00
Noah Levitt
dcf4811470
Merge branch 'master' into safe-thread-raise
2017-04-24 20:06:37 -07:00
Noah Levitt
0953e6972e
refactor thread_raise safety to use a context manager
2017-04-24 19:51:51 -07:00
Noah Levitt
f140e5bdbd
allow this stupid test to fail
2017-04-21 12:17:11 -07:00
Noah Levitt
ba519d7288
improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
2017-04-20 18:04:17 -07:00
Noah Levitt
7706bab8b8
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
2017-04-20 17:08:16 -07:00
Noah Levitt
8256a34b4f
implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker
2017-04-18 17:54:12 -07:00
Noah Levitt
5603ff5380
have _warcprox_write_record also raise ProxyError when appropriate, and test this
2017-04-18 16:58:51 -07:00
Noah Levitt
ac972d399f
fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth
2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363
raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
2017-04-17 18:15:22 -07:00
Noah Levitt
349b41ab32
raise new exception brozzler.ProxyError in case of proxy error browsing a page
2017-04-17 18:14:02 -07:00
Noah Levitt
0884b4cd56
bubble up proxy errors fetching robots.txt, with unit test, and documentation
2017-04-17 16:47:05 -07:00
Noah Levitt
df7734f2ca
new command line utility brozzler-stop-crawl, with tests
2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960
parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run
2017-04-14 11:46:26 -07:00
Noah Levitt
5bcd10c228
extract area/@href links, and add test for outlink extraction
2017-04-05 12:09:48 -07:00
Noah Levitt
3d47805ec1
new model for crawling hashtags, each one is no longer a top-level page
2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95
remove some vestiges of old proxy stuff
2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef
new test of frontier.seed_page
2017-03-24 15:45:40 -07:00
Noah Levitt
934190084c
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3 ), and enables warcprox features if so.
2017-03-24 13:55:23 -07:00
Noah Levitt
34bb64297f
fix frontier tests now that enable_warcprox_features is simply omitted by default
2017-03-22 15:46:12 -07:00
Noah Levitt
eeee523b18
three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things
2017-03-20 17:28:16 -07:00
Noah Levitt
0e9f4a0c26
forgot to add the new test data
2017-03-20 12:33:52 -07:00
Noah Levitt
e9c7606318
oops remove pdb call
2017-03-20 12:14:11 -07:00
Noah Levitt
13130bd9d9
save info about embedded videos in page document in rethinkdb
2017-03-20 11:49:11 -07:00
Noah Levitt
0685c77d01
always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing
2017-03-17 10:04:10 -07:00
Noah Levitt
6c81b40e28
if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests
2017-03-16 12:12:33 -07:00
Noah Levitt
12fb9eaa15
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 14:59:51 -07:00
Noah Levitt
479f0f7e09
more automated tests of frontier stuff
2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71
turns out we want populate_defaults to happen in __init__, fix so things work right
2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7
use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults
2017-03-07 13:19:56 -08:00