Noah Levitt
|
df7734f2ca
|
new command line utility brozzler-stop-crawl, with tests
|
2017-04-14 18:06:15 -07:00 |
|
Noah Levitt
|
3d47805ec1
|
new model for crawling hashtags, each one is no longer a top-level page
|
2017-03-27 12:15:49 -07:00 |
|
Noah Levitt
|
a826fdc7ef
|
new test of frontier.seed_page
|
2017-03-24 15:45:40 -07:00 |
|
Noah Levitt
|
934190084c
|
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so.
|
2017-03-24 13:55:23 -07:00 |
|
Noah Levitt
|
34bb64297f
|
fix frontier tests now that enable_warcprox_features is simply omitted by default
|
2017-03-22 15:46:12 -07:00 |
|
Noah Levitt
|
eeee523b18
|
three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things
|
2017-03-20 17:28:16 -07:00 |
|
Noah Levitt
|
0685c77d01
|
always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing
|
2017-03-17 10:04:10 -07:00 |
|
Noah Levitt
|
6c81b40e28
|
if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests
|
2017-03-16 12:12:33 -07:00 |
|
Noah Levitt
|
479f0f7e09
|
more automated tests of frontier stuff
|
2017-03-15 14:54:16 -07:00 |
|
Noah Levitt
|
9e1e002a71
|
turns out we want populate_defaults to happen in __init__, fix so things work right
|
2017-03-07 17:52:38 -08:00 |
|
Noah Levitt
|
01653c01d7
|
use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults
|
2017-03-07 13:19:56 -08:00 |
|
Noah Levitt
|
569af05b11
|
rethinkstuff is now "doublethink
|
2017-03-02 12:48:45 -08:00 |
|
Noah Levitt
|
14e312e4c4
|
make sure site is not "claimed" when it's finished
|
2017-02-03 16:40:15 -08:00 |
|
Noah Levitt
|
a60878c5a7
|
support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly
|
2017-02-03 14:56:12 -08:00 |
|