34 Commits

Author SHA1 Message Date
Noah Levitt
e23fa68d65 fix bug clobbering own changes to parent_page
and some other tweaks (python 3.5+, pytest logging config, ...)
2019-10-17 13:47:54 -07:00
Noah Levitt
05fab8b909 change time limit enforcement
enforce time limit based on all the time that a site was in active
rotation, including time it spent waiting for its turn to be brozzled;
this undoes the change from b9640b8a30c934, because now it seems that
was the wrong decision (brozzler jobs with many seeds and low
max_claimed_sites hanging around forever)
2018-11-12 16:21:38 -08:00
Noah Levitt
5bb392ec7c ssurts are strings now
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
1572fd3ed6 missed a spot where is_permitted_by_robots needs monkeying 2018-05-15 16:52:48 -07:00
Noah Levitt
85a4757527 s/max_hops_off_surt/max_hops_off/ 2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709 new test of max_hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df rename page.hops_off_surt to page.hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a tests for new approach without of scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26d711a89 new job setting max_claimed_sites
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs. For AITFIVE-1578.
2018-03-01 17:17:54 -08:00
Noah Levitt
d7512fbeb6 move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Noah Levitt
7962444f09 claim sites to brozzle in batches to reduce contention over sites table 2018-02-02 13:56:24 -08:00
Noah Levitt
7f78c335e1
--warcprox-auto distribute assigned sites evenly (#78)
--warcprox-auto distribute assigned sites evenly

When running with --warcprox-auto, choose the instance of warcprox with
the least number of assigned sites, instead of the lowest load in the
service registry. In practice we often start brozzling a whole bunch of
sites at approximately the same time, and because it takes time for that
to affect the "load" reported by warcprox instances, sites end up being
distributed very unevenly.
2018-01-19 14:54:33 -08:00
Daniel Bicho
c4fa612547 fix some errors in test_resume_job 2017-10-17 10:33:26 +01:00
Daniel Bicho
bb98a43c8c fix and test both job stop request and site stop requests 2017-10-16 11:46:35 +01:00
Daniel Bicho
8aa10962bc test resume_job adding a simulation of a crawl job stopped and then resumed. 2017-10-15 19:11:46 +01:00
Daniel Bicho
378c097c29 add verification change to test_resume_job 2017-10-13 12:13:51 +01:00
Noah Levitt
3385d727ac minimally update test_time_limit for new time accounting 2017-06-26 17:57:50 -07:00
Noah Levitt
405c5725e4 restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site 2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646 disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue 2017-06-19 11:21:02 -07:00
Noah Levitt
52433ade78 re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker 2017-05-01 13:00:04 -07:00
Noah Levitt
df7734f2ca new command line utility brozzler-stop-crawl, with tests 2017-04-14 18:06:15 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
a826fdc7ef new test of frontier.seed_page 2017-03-24 15:45:40 -07:00
Noah Levitt
934190084c Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so. 2017-03-24 13:55:23 -07:00
Noah Levitt
34bb64297f fix frontier tests now that enable_warcprox_features is simply omitted by default 2017-03-22 15:46:12 -07:00
Noah Levitt
eeee523b18 three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things 2017-03-20 17:28:16 -07:00
Noah Levitt
0685c77d01 always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing 2017-03-17 10:04:10 -07:00
Noah Levitt
6c81b40e28 if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests 2017-03-16 12:12:33 -07:00
Noah Levitt
479f0f7e09 more automated tests of frontier stuff 2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71 turns out we want populate_defaults to happen in __init__, fix so things work right 2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7 use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults 2017-03-07 13:19:56 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
14e312e4c4 make sure site is not "claimed" when it's finished 2017-02-03 16:40:15 -08:00
Noah Levitt
a60878c5a7 support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly 2017-02-03 14:56:12 -08:00