Noah Levitt
|
331d07fe88
|
these ssurts are strings too
|
2018-05-16 17:11:08 -07:00 |
|
Noah Levitt
|
5bb392ec7c
|
ssurts are strings now
because they're friendlier that way in rethinkdb
|
2018-05-16 16:43:10 -07:00 |
|
Noah Levitt
|
fc05cac338
|
ok seriously tests
|
2018-05-14 15:38:28 -07:00 |
|
Noah Levitt
|
05f8ab3495
|
fix more tests for new approach sans scope['surt']
|
2018-05-14 15:38:28 -07:00 |
|
Noah Levitt
|
d7512fbeb6
|
move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
|
2018-03-01 11:28:30 -08:00 |
|
Noah Levitt
|
8505720c41
|
fix tests
|
2018-02-02 15:11:26 -08:00 |
|
Noah Levitt
|
384c877e9a
|
new test exposing problem where each hashtag visited causes a page load, if page redirects
|
2017-09-27 14:08:28 -07:00 |
|
Noah Levitt
|
8256a34b4f
|
implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker
|
2017-04-18 17:54:12 -07:00 |
|
Noah Levitt
|
df7734f2ca
|
new command line utility brozzler-stop-crawl, with tests
|
2017-04-14 18:06:15 -07:00 |
|
Noah Levitt
|
3d47805ec1
|
new model for crawling hashtags, each one is no longer a top-level page
|
2017-03-27 12:15:49 -07:00 |
|
Noah Levitt
|
a836269e95
|
remove some vestiges of old proxy stuff
|
2017-03-24 16:04:43 -07:00 |
|
Noah Levitt
|
934190084c
|
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so.
|
2017-03-24 13:55:23 -07:00 |
|
Noah Levitt
|
242ff51ec7
|
fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests)
|
2017-03-06 15:13:40 -08:00 |
|
Noah Levitt
|
569af05b11
|
rethinkstuff is now "doublethink
|
2017-03-02 12:48:45 -08:00 |
|
Noah Levitt
|
5c684779e5
|
pywb support for thumbnail: and screenshot: urls
|
2017-01-31 10:26:38 -08:00 |
|
Noah Levitt
|
4b6831b464
|
new flag Page.blocked_by_robots
|
2017-01-30 10:43:25 -08:00 |
|
Noah Levitt
|
86ac48d6c3
|
generalized support for login doing automatic detection of login form on a page
|
2016-12-19 17:30:09 -08:00 |
|
Noah Levitt
|
72816d1058
|
don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence
|
2016-11-16 12:23:59 -08:00 |
|
Noah Levitt
|
5ac8994a24
|
rename webconsole to dashboard
|
2016-11-04 17:46:23 -07:00 |
|
Mouse Reeve
|
2215aaab21
|
Use warcprox if enable_warcprox_features is true
|
2016-10-18 17:39:33 -07:00 |
|
Noah Levitt
|
a370e7b987
|
tiny fix, and now the test passes for me
|
2016-10-14 19:21:26 -07:00 |
|
Noah Levitt
|
27452990ee
|
toward getting initial tests to pass
|
2016-10-14 18:26:48 -07:00 |
|
Noah Levitt
|
56e651baeb
|
working on basic integration tests
|
2016-10-13 17:12:35 -07:00 |
|
Noah Levitt
|
c864499a64
|
starting to create a framework for testing
|
2016-09-14 17:06:49 -07:00 |
|