281 Commits

Author SHA1 Message Date
Noah Levitt
ac972d399f fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth 2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363 raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch 2017-04-17 18:15:22 -07:00
Noah Levitt
349b41ab32 raise new exception brozzler.ProxyError in case of proxy error browsing a page 2017-04-17 18:14:02 -07:00
Noah Levitt
87a7301f4d make brozzle-page respect --proxy (no test for this!) 2017-04-17 18:11:09 -07:00
Noah Levitt
0e90950de2 oops, version bump for previous commit 2017-04-17 18:10:56 -07:00
Noah Levitt
df7734f2ca new command line utility brozzler-stop-crawl, with tests 2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960 parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run 2017-04-14 11:46:26 -07:00
Noah Levitt
b3cf746f53 stupid version number bump 2017-04-05 17:01:52 -07:00
Noah Levitt
62917a6f1a Revert "bump version number for last pull request"
This reverts commit d192fc269eddeb8b06888e95bb6e4a6639e34415.
2017-04-05 17:01:06 -07:00
Noah Levitt
d192fc269e bump version number for last pull request 2017-04-05 16:15:24 -07:00
Noah Levitt
5bcd10c228 extract area/@href links, and add test for outlink extraction 2017-04-05 12:09:48 -07:00
Noah Levitt
d4d3ef4fd3 ugh fix version number 2017-03-30 17:53:36 -07:00
Noah Levitt
125d77b8c4 consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin 2017-03-29 18:49:04 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95 remove some vestiges of old proxy stuff 2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef new test of frontier.seed_page 2017-03-24 15:45:40 -07:00
Noah Levitt
0e35de43b6 actually respect --proxy and --warcprox-auto options to brozzler-worker 2017-03-24 22:27:52 +00:00
Noah Levitt
934190084c Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so. 2017-03-24 13:55:23 -07:00
Noah Levitt
9a2f181eb6 back to a dev version number 2017-03-22 16:12:39 -07:00
Noah Levitt
613dca29dc 1.1b10 since 1.1b9 has bugs :( 2017-03-22 16:11:26 -07:00
Noah Levitt
4ba25db684 ugh, avoid infinite recursion 2017-03-22 15:53:58 -07:00
Noah Levitt
34bb64297f fix frontier tests now that enable_warcprox_features is simply omitted by default 2017-03-22 15:46:12 -07:00
Noah Levitt
4aa611af52 i dub thee 1.1b9 2017-03-22 15:25:55 -07:00
Noah Levitt
aae810cc6e fix brozzler-easy so that warcprox features are enabled automatically (feature was already there but broken) 2017-03-22 15:15:07 -07:00
Noah Levitt
603956ec41 restore accidentally deleted line of code 2017-03-21 13:08:18 -07:00
Noah Levitt
95ba334b89 initialize page.videos correctly in all cases 2017-03-21 11:10:57 -07:00
Noah Levitt
eeee523b18 three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things 2017-03-20 17:28:16 -07:00
Noah Levitt
0e9f4a0c26 forgot to add the new test data 2017-03-20 12:33:52 -07:00
Noah Levitt
e9c7606318 oops remove pdb call 2017-03-20 12:14:11 -07:00
Noah Levitt
13130bd9d9 save info about embedded videos in page document in rethinkdb 2017-03-20 11:49:11 -07:00
Noah Levitt
94ba56dca5 actually implement the brozzler-list-jobs --job option 2017-03-17 11:14:45 -07:00
Noah Levitt
0685c77d01 always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing 2017-03-17 10:04:10 -07:00
Noah Levitt
701f7654a8 make brozzler-list-* a little more intuitive, maybe 2017-03-16 13:01:41 -07:00
Noah Levitt
6c81b40e28 if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests 2017-03-16 12:12:33 -07:00
Noah Levitt
0021a9d5f0 add the new urlcanon.MatchRule conditions to job_schema.yaml 2017-03-15 17:08:27 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
479f0f7e09 more automated tests of frontier stuff 2017-03-15 14:54:16 -07:00
Noah Levitt
9e1e002a71 turns out we want populate_defaults to happen in __init__, fix so things work right 2017-03-07 17:52:38 -08:00
Noah Levitt
01653c01d7 use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults 2017-03-07 13:19:56 -08:00
Noah Levitt
242ff51ec7 fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests) 2017-03-06 15:13:40 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
700b08b7d7 use new rethinkstuff ORM 2017-02-28 16:12:50 -08:00
Noah Levitt
a1f1681cad fix issue where use of YoutubeDLSpy caused youtube-dl connections to remote servers to be kept open 2017-02-24 11:15:17 -08:00
Noah Levitt
b4f19e2594 fix typo 2017-02-23 10:47:04 -08:00
Noah Levitt
7417310d57 more pywb monkey-patching to get at least some youtube videos captured by brozzler to play back 2017-02-23 10:43:07 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
3c4ab834da handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭 2017-02-22 10:57:11 -08:00
Noah Levitt
0d0da22613 brozzler-list-jobs --yaml 2017-02-16 10:20:36 -08:00
Noah Levitt
f02d4ed40e missed this in the last commit 2017-02-15 23:20:47 -08:00
Noah Levitt
b409e49cfa deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping 2017-02-15 16:46:45 -08:00