910 Commits

Author SHA1 Message Date
Noah Levitt
0021a9d5f0 add the new urlcanon.MatchRule conditions to job_schema.yaml 2017-03-15 17:08:27 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
479f0f7e09 more automated tests of frontier stuff 2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71 turns out we want populate_defaults to happen in __init__, fix so things work right 2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7 use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults 2017-03-07 13:19:56 -08:00
Noah Levitt
242ff51ec7 fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests) 2017-03-06 15:13:40 -08:00
Noah Levitt
40bbbb3524 add tests of backwards compatibility handling of start/stop times and fix a bug or two 2017-03-02 16:53:24 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
700b08b7d7 use new rethinkstuff ORM 2017-02-28 16:12:50 -08:00
Noah Levitt
a1f1681cad fix issue where use of YoutubeDLSpy caused youtube-dl connections to remote servers to be kept open 2017-02-24 11:15:17 -08:00
Noah Levitt
b4f19e2594 fix typo 2017-02-23 10:47:04 -08:00
Noah Levitt
7417310d57 more pywb monkey-patching to get at least some youtube videos captured by brozzler to play back 2017-02-23 10:43:07 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
3c4ab834da handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭 2017-02-22 10:57:11 -08:00
Noah Levitt
0d0da22613 brozzler-list-jobs --yaml 2017-02-16 10:20:36 -08:00
Noah Levitt
f02d4ed40e missed this in the last commit 2017-02-15 23:20:47 -08:00
Noah Levitt
b409e49cfa deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping 2017-02-15 16:46:45 -08:00
Noah Levitt
c0057e591a add --yaml option to brozzler-list-* commands 2017-02-15 23:13:09 +00:00
Noah Levitt
1054e8e3cb take screenshot before running behavior (but after login) - thanks danielbicho 2017-02-15 09:13:44 -08:00
Noah Levitt
e58f4b7c44 logging tweaks 2017-02-10 15:19:28 -08:00
Noah Levitt
09fa41f959 fix TypeError: not all arguments converted during string formatting 2017-02-03 17:24:47 -08:00
Noah Levitt
14e312e4c4 make sure site is not "claimed" when it's finished 2017-02-03 16:40:15 -08:00
Noah Levitt
a60878c5a7 support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly 2017-02-03 14:56:12 -08:00
Noah Levitt
5a0301ac12 let rethinkdb generate job.id if not supplied in configuration 2017-02-03 14:53:50 -08:00
Noah Levitt
129a1e8f47 use underscore convention 2017-02-02 11:52:19 -08:00
Noah Levitt
5f4c5190da improve TRACE level logging 2017-02-02 11:41:40 -08:00
Noah Levitt
ed2d58d87d stopgap fix for problem where an attempt to save a screenshot of a url with a hash tag containing spaces or non-ascii characters would fail, causing the whole brozzle of the page to fail, and end up in a retry loop (better handling of hash tags is planned which will obviate this change) 2017-02-01 22:39:12 +00:00
Noah Levitt
5c684779e5 pywb support for thumbnail: and screenshot: urls 2017-01-31 10:26:38 -08:00
Noah Levitt
8f5003b784 fix oops 2017-01-30 23:47:39 -08:00
Noah Levitt
4b6831b464 new flag Page.blocked_by_robots 2017-01-30 10:43:25 -08:00
Noah Levitt
a8b564f100 be more patient to avoid spurious warnings waiting for browser to start up 2017-01-24 10:06:37 -08:00
Noah Levitt
d22cc075e0 restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select() 2017-01-24 09:55:56 -08:00
Noah Levitt
5375b819dd missed a spot 2017-01-20 23:59:31 -08:00
Noah Levitt
c3b637d244 improve brozzler-dashboard logging; fix default wayback baseurl in brozzler dashboard (https://github.com/internetarchive/brozzler/issues/31); tweak arg parsing related stuff 2017-01-20 23:41:59 -08:00
Noah Levitt
095456aa27 avoid js errors in case site or job is not configured to keep stats 2017-01-20 23:36:23 -08:00
Noah Levitt
65f818e901 add travis-ci slack notification to internetarchive/brozzler channel 2017-01-16 12:44:12 -08:00
Noah Levitt
037723fe2b support for BROZZLER_RETHINKDB_SERVERS and BROZZLER_RETHINKDB_DB environment variables, honored by all the brozzler-* commands 2017-01-13 20:27:09 +00:00
Noah Levitt
77c4dc1116 adapt to exception message from newer versions of chromium (e.g. 57.0.2981.0) 2017-01-13 12:08:00 -08:00
Noah Levitt
011d814ee2 tests for dismissal of javascript dialogs (alert, prompt, confirm) 2017-01-13 11:46:42 -08:00
Noah Levitt
d2ed6b97a2 dismiss alerts from the page being browsed (avoids hanging) 2017-01-13 10:27:37 -08:00
Noah Levitt
766441e65c simpleclicks - only click if element is visible, fixes spinning on moma.org sites 2017-01-12 23:23:46 -08:00
Noah Levitt
38d9eee68d implement brozzler-list-pages 2017-01-12 08:22:45 +00:00
Noah Levitt
184612332e new cli utils brozzler-list-jobs and brozzler-list-sites 2017-01-12 07:50:58 +00:00
Noah Levitt
64a0ea879a implement sha1 lookup and url prefix lookup for brozzler-list-captures 2017-01-12 01:26:09 +00:00
Noah Levitt
32097a8f8b catch exceptions parsing funky urls when scoping and extracting outlinks 2017-01-09 15:18:19 -08:00
Noah Levitt
2486768830 fix bug where login form would not be detected in some cases when there was a non-login form earlier on the page 2017-01-09 11:40:30 -08:00
Noah Levitt
d0022fe7bf reset browser shutdown flag when starting up 2017-01-06 17:57:11 -08:00
Noah Levitt
76b658747e fix oversight including username/password in site config when starting a new job 2017-01-06 13:03:09 -08:00
Noah Levitt
c2704b18be restore BrozzlerWorker built-in support for managing its own thread 2017-01-04 14:57:34 -08:00
Noah Levitt
70b67942a5 restore handling of 420 Reached limit, with a rudimentary test 2016-12-22 13:44:09 -08:00