124 Commits

Author SHA1 Message Date
Noah Levitt
b81cc4eb0a remove stray pdb line 2017-11-08 17:03:54 -08:00
Noah Levitt
133726e942 test a real-ish mpd 2017-11-08 17:01:27 -08:00
Barbara Miller
e8fdf84db8 add test--not a Video 2017-11-07 17:23:51 -08:00
Daniel Bicho
c4fa612547 fix some errors in test_resume_job 2017-10-17 10:33:26 +01:00
Daniel Bicho
bb98a43c8c fix and test both job stop request and site stop requests 2017-10-16 11:46:35 +01:00
Daniel Bicho
8aa10962bc test resume_job adding a simulation of a crawl job stopped and then resumed. 2017-10-15 19:11:46 +01:00
Daniel Bicho
378c097c29 add verification change to test_resume_job 2017-10-13 12:13:51 +01:00
Noah Levitt
384c877e9a new test exposing problem where each hashtag visited causes a page load, if page redirects 2017-09-27 14:08:28 -07:00
Noah Levitt
3385d727ac minimally update test_time_limit for new time accounting 2017-06-26 17:57:50 -07:00
Noah Levitt
405c5725e4 restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site 2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646 disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue 2017-06-19 11:21:02 -07:00
Noah Levitt
d514eaec15 even more, better failing tests for thread_raise 2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87 failing test for forthcoming behavior of thread_raise 2017-05-15 16:20:20 -07:00
Noah Levitt
52433ade78 re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker 2017-05-01 13:00:04 -07:00
Noah Levitt
dcf4811470 Merge branch 'master' into safe-thread-raise 2017-04-24 20:06:37 -07:00
Noah Levitt
0953e6972e refactor thread_raise safety to use a context manager 2017-04-24 19:51:51 -07:00
Noah Levitt
f140e5bdbd allow this stupid test to fail 2017-04-21 12:17:11 -07:00
Noah Levitt
ba519d7288 improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id 2017-04-20 18:04:17 -07:00
Noah Levitt
7706bab8b8 safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such 2017-04-20 17:08:16 -07:00
Noah Levitt
8256a34b4f implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker 2017-04-18 17:54:12 -07:00
Noah Levitt
5603ff5380 have _warcprox_write_record also raise ProxyError when appropriate, and test this 2017-04-18 16:58:51 -07:00
Noah Levitt
ac972d399f fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth 2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363 raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch 2017-04-17 18:15:22 -07:00
Noah Levitt
349b41ab32 raise new exception brozzler.ProxyError in case of proxy error browsing a page 2017-04-17 18:14:02 -07:00
Noah Levitt
0884b4cd56 bubble up proxy errors fetching robots.txt, with unit test, and documentation 2017-04-17 16:47:05 -07:00
Noah Levitt
df7734f2ca new command line utility brozzler-stop-crawl, with tests 2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960 parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run 2017-04-14 11:46:26 -07:00
Noah Levitt
5bcd10c228 extract area/@href links, and add test for outlink extraction 2017-04-05 12:09:48 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95 remove some vestiges of old proxy stuff 2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef new test of frontier.seed_page 2017-03-24 15:45:40 -07:00
Noah Levitt
934190084c Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so. 2017-03-24 13:55:23 -07:00
Noah Levitt
34bb64297f fix frontier tests now that enable_warcprox_features is simply omitted by default 2017-03-22 15:46:12 -07:00
Noah Levitt
eeee523b18 three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things 2017-03-20 17:28:16 -07:00
Noah Levitt
0e9f4a0c26 forgot to add the new test data 2017-03-20 12:33:52 -07:00
Noah Levitt
e9c7606318 oops remove pdb call 2017-03-20 12:14:11 -07:00
Noah Levitt
13130bd9d9 save info about embedded videos in page document in rethinkdb 2017-03-20 11:49:11 -07:00
Noah Levitt
0685c77d01 always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing 2017-03-17 10:04:10 -07:00
Noah Levitt
6c81b40e28 if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests 2017-03-16 12:12:33 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
479f0f7e09 more automated tests of frontier stuff 2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71 turns out we want populate_defaults to happen in __init__, fix so things work right 2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7 use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults 2017-03-07 13:19:56 -08:00
Noah Levitt
242ff51ec7 fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests) 2017-03-06 15:13:40 -08:00
Noah Levitt
40bbbb3524 add tests of backwards compatibility handling of start/stop times and fix a bug or two 2017-03-02 16:53:24 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
b409e49cfa deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping 2017-02-15 16:46:45 -08:00
Noah Levitt
14e312e4c4 make sure site is not "claimed" when it's finished 2017-02-03 16:40:15 -08:00
Noah Levitt
a60878c5a7 support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly 2017-02-03 14:56:12 -08:00