470 Commits

Author SHA1 Message Date
Noah Levitt
d4d3ef4fd3 ugh fix version number 2017-03-30 17:53:36 -07:00
Noah Levitt
125d77b8c4 consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin 2017-03-29 18:49:04 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95 remove some vestiges of old proxy stuff 2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef new test of frontier.seed_page 2017-03-24 15:45:40 -07:00
Noah Levitt
0e35de43b6 actually respect --proxy and --warcprox-auto options to brozzler-worker 2017-03-24 22:27:52 +00:00
Noah Levitt
934190084c Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so. 2017-03-24 13:55:23 -07:00
Noah Levitt
9a2f181eb6 back to a dev version number 2017-03-22 16:12:39 -07:00
Noah Levitt
613dca29dc 1.1b10 since 1.1b9 has bugs :( 2017-03-22 16:11:26 -07:00
Noah Levitt
4ba25db684 ugh, avoid infinite recursion 2017-03-22 15:53:58 -07:00
Noah Levitt
34bb64297f fix frontier tests now that enable_warcprox_features is simply omitted by default 2017-03-22 15:46:12 -07:00
Noah Levitt
4aa611af52 i dub thee 1.1b9 2017-03-22 15:25:55 -07:00
Noah Levitt
aae810cc6e fix brozzler-easy so that warcprox features are enabled automatically (feature was already there but broken) 2017-03-22 15:15:07 -07:00
Noah Levitt
603956ec41 restore accidentally deleted line of code 2017-03-21 13:08:18 -07:00
Noah Levitt
95ba334b89 initialize page.videos correctly in all cases 2017-03-21 11:10:57 -07:00
Noah Levitt
eeee523b18 three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things 2017-03-20 17:28:16 -07:00
Noah Levitt
0e9f4a0c26 forgot to add the new test data 2017-03-20 12:33:52 -07:00
Noah Levitt
e9c7606318 oops remove pdb call 2017-03-20 12:14:11 -07:00
Noah Levitt
13130bd9d9 save info about embedded videos in page document in rethinkdb 2017-03-20 11:49:11 -07:00
Noah Levitt
94ba56dca5 actually implement the brozzler-list-jobs --job option 2017-03-17 11:14:45 -07:00
Noah Levitt
0685c77d01 always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing 2017-03-17 10:04:10 -07:00
Noah Levitt
701f7654a8 make brozzler-list-* a little more intuitive, maybe 2017-03-16 13:01:41 -07:00
Noah Levitt
6c81b40e28 if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests 2017-03-16 12:12:33 -07:00
Noah Levitt
0021a9d5f0 add the new urlcanon.MatchRule conditions to job_schema.yaml 2017-03-15 17:08:27 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
479f0f7e09 more automated tests of frontier stuff 2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71 turns out we want populate_defaults to happen in __init__, fix so things work right 2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7 use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults 2017-03-07 13:19:56 -08:00
Noah Levitt
242ff51ec7 fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests) 2017-03-06 15:13:40 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
700b08b7d7 use new rethinkstuff ORM 2017-02-28 16:12:50 -08:00
Noah Levitt
a1f1681cad fix issue where use of YoutubeDLSpy caused youtube-dl connections to remote servers to be kept open 2017-02-24 11:15:17 -08:00
Noah Levitt
b4f19e2594 fix typo 2017-02-23 10:47:04 -08:00
Noah Levitt
7417310d57 more pywb monkey-patching to get at least some youtube videos captured by brozzler to play back 2017-02-23 10:43:07 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
3c4ab834da handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭 2017-02-22 10:57:11 -08:00
Noah Levitt
0d0da22613 brozzler-list-jobs --yaml 2017-02-16 10:20:36 -08:00
Noah Levitt
f02d4ed40e missed this in the last commit 2017-02-15 23:20:47 -08:00
Noah Levitt
b409e49cfa deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping 2017-02-15 16:46:45 -08:00
Noah Levitt
c0057e591a add --yaml option to brozzler-list-* commands 2017-02-15 23:13:09 +00:00
Noah Levitt
1054e8e3cb take screenshot before running behavior (but after login) - thanks danielbicho 2017-02-15 09:13:44 -08:00
Noah Levitt
e58f4b7c44 logging tweaks 2017-02-10 15:19:28 -08:00
Noah Levitt
09fa41f959 fix TypeError: not all arguments converted during string formatting 2017-02-03 17:24:47 -08:00
Noah Levitt
14e312e4c4 make sure site is not "claimed" when it's finished 2017-02-03 16:40:15 -08:00
Noah Levitt
a60878c5a7 support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly 2017-02-03 14:56:12 -08:00
Noah Levitt
5a0301ac12 let rethinkdb generate job.id if not supplied in configuration 2017-02-03 14:53:50 -08:00
Noah Levitt
129a1e8f47 use underscore convention 2017-02-02 11:52:19 -08:00
Noah Levitt
5f4c5190da improve TRACE level logging 2017-02-02 11:41:40 -08:00
Noah Levitt
ed2d58d87d stopgap fix for problem where an attempt to save a screenshot of a url with a hash tag containing spaces or non-ascii characters would fail, causing the whole brozzle of the page to fail, and end up in a retry loop (better handling of hash tags is planned which will obviate this change) 2017-02-01 22:39:12 +00:00
Noah Levitt
5c684779e5 pywb support for thumbnail: and screenshot: urls 2017-01-31 10:26:38 -08:00