Barbara Miller
e2b2542d4a
handle http auth ( #138 )
...
abort brozzling on insterstial (auth dialog)
because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
f4fad934a7
verbiage tweaks
2018-09-25 15:19:33 -07:00
Noah Levitt
560981c1ad
safety check and --force for brozzler-purge
2018-09-25 15:17:45 -07:00
Noah Levitt
174178e02e
new command brozzler-purge
2018-09-25 14:56:26 -07:00
Noah Levitt
d0f5cd7168
tweak logging
2018-08-31 15:23:48 -07:00
Noah Levitt
4e398e1da2
expose more brozzle-page args
2018-08-13 15:38:24 -07:00
Noah Levitt
506ab0ccc2
check browser version at startup
2018-02-06 15:56:50 -08:00
Noah Levitt
7f78c335e1
--warcprox-auto distribute assigned sites evenly ( #78 )
...
--warcprox-auto distribute assigned sites evenly
When running with --warcprox-auto, choose the instance of warcprox with
the least number of assigned sites, instead of the lowest load in the
service registry. In practice we often start brozzling a whole bunch of
sites at approximately the same time, and because it takes time for that
to affect the "load" reported by warcprox instances, sites end up being
distributed very unevenly.
2018-01-19 14:54:33 -08:00
Noah Levitt
daecb4f59e
fix brozzler-list-sites --site=SITE_ID
2017-12-21 17:16:41 -08:00
Barbara Miller
e6bb6791af
skip unnecessary assignment
2017-09-29 14:53:24 -07:00
Barbara Miller
5e7b3b73dd
skip_youtube_dl
2017-09-29 14:33:23 -07:00
Vangelis Banos
8019eb4b5f
Hide the options using argparse.SUPPRESS
2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c
add skip cli options to brozzle-page
...
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4
--skip-extract-outlinks, --skip-visit-hashtags
...
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
caee2787b0
have brozzler-list-sites --active use the index
2017-06-24 01:05:19 +00:00
Noah Levitt
4d7f4518b5
use %r instead of calling repr()
2017-06-07 13:07:42 -07:00
Noah Levitt
b4bf17df9b
do a better job of making sure to shut down the browser when brozzle-page is killed
2017-05-03 16:43:31 -07:00
Noah Levitt
389db01458
BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls
2017-05-01 13:46:19 -07:00
Noah Levitt
ba519d7288
improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
2017-04-20 18:04:17 -07:00
Noah Levitt
87a7301f4d
make brozzle-page respect --proxy (no test for this!)
2017-04-17 18:11:09 -07:00
Noah Levitt
df7734f2ca
new command line utility brozzler-stop-crawl, with tests
2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960
parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run
2017-04-14 11:46:26 -07:00
Noah Levitt
62917a6f1a
Revert "bump version number for last pull request"
...
This reverts commit d192fc269e
.
2017-04-05 17:01:06 -07:00
Noah Levitt
d192fc269e
bump version number for last pull request
2017-04-05 16:15:24 -07:00
Noah Levitt
125d77b8c4
consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin
2017-03-29 18:49:04 -07:00
Noah Levitt
a836269e95
remove some vestiges of old proxy stuff
2017-03-24 16:04:43 -07:00
Noah Levitt
0e35de43b6
actually respect --proxy and --warcprox-auto options to brozzler-worker
2017-03-24 22:27:52 +00:00
Noah Levitt
934190084c
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see 8caae0d7d3
), and enables warcprox features if so.
2017-03-24 13:55:23 -07:00
Noah Levitt
aae810cc6e
fix brozzler-easy so that warcprox features are enabled automatically (feature was already there but broken)
2017-03-22 15:15:07 -07:00
Noah Levitt
94ba56dca5
actually implement the brozzler-list-jobs --job option
2017-03-17 11:14:45 -07:00
Noah Levitt
701f7654a8
make brozzler-list-* a little more intuitive, maybe
2017-03-16 13:01:41 -07:00
Noah Levitt
12fb9eaa15
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 14:59:51 -07:00
Noah Levitt
569af05b11
rethinkstuff is now "doublethink
2017-03-02 12:48:45 -08:00
Noah Levitt
700b08b7d7
use new rethinkstuff ORM
2017-02-28 16:12:50 -08:00
Noah Levitt
0d0da22613
brozzler-list-jobs --yaml
2017-02-16 10:20:36 -08:00
Noah Levitt
c0057e591a
add --yaml option to brozzler-list-* commands
2017-02-15 23:13:09 +00:00
Noah Levitt
5f4c5190da
improve TRACE level logging
2017-02-02 11:41:40 -08:00
Noah Levitt
8f5003b784
fix oops
2017-01-30 23:47:39 -08:00
Noah Levitt
c3b637d244
improve brozzler-dashboard logging; fix default wayback baseurl in brozzler dashboard ( https://github.com/internetarchive/brozzler/issues/31 ); tweak arg parsing related stuff
2017-01-20 23:41:59 -08:00
Noah Levitt
037723fe2b
support for BROZZLER_RETHINKDB_SERVERS and BROZZLER_RETHINKDB_DB environment variables, honored by all the brozzler-* commands
2017-01-13 20:27:09 +00:00
Noah Levitt
38d9eee68d
implement brozzler-list-pages
2017-01-12 08:22:45 +00:00
Noah Levitt
184612332e
new cli utils brozzler-list-jobs and brozzler-list-sites
2017-01-12 07:50:58 +00:00
Noah Levitt
64a0ea879a
implement sha1 lookup and url prefix lookup for brozzler-list-captures
2017-01-12 01:26:09 +00:00
Noah Levitt
86ac48d6c3
generalized support for login doing automatic detection of login form on a page
2016-12-19 17:30:09 -08:00
Noah Levitt
c71854127d
major refactoring of browsing code to make it easier to add functionality
2016-12-15 16:42:45 -08:00
Noah Levitt
3c43fdaced
new utility brozzler-list-captures for looking up entries in the "captures" table
2016-11-30 00:52:14 +00:00
Noah Levitt
72816d1058
don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence
2016-11-16 12:23:59 -08:00
Noah Levitt
8e115b44fa
add --behavior-parameters argument to brozzler-new-site
2016-11-09 13:12:36 -08:00
Noah Levitt
9d66f294ec
move behavior_parameters into top level of site configuration
2016-11-07 18:16:04 -08:00
Barbara Miller
6c7f88c171
initial login additions
2016-11-02 16:04:18 -07:00