Noah Levitt
|
d2525e2e87
|
failing test for forthcoming behavior of thread_raise
|
2017-05-15 16:20:20 -07:00 |
|
Noah Levitt
|
60c5a7c1c4
|
recognize ConnectionError (of which ConnectionResetError is a subclass) in _warcprox_write_record as a proxy error
|
2017-05-12 10:03:53 -07:00 |
|
Barbara Miller
|
054625b8a5
|
Merge pull request #40 from BitBaron/ari-4960
Crawl Google Calendar for fortstjames.ca
|
2017-05-09 14:12:48 -07:00 |
|
Noah Levitt
|
b4bf17df9b
|
do a better job of making sure to shut down the browser when brozzle-page is killed
|
2017-05-03 16:43:31 -07:00 |
|
Noah Levitt
|
9d4cbbf6eb
|
handle another rethinkdb outage corner case
|
2017-05-01 14:12:43 -07:00 |
|
Noah Levitt
|
389db01458
|
BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls
|
2017-05-01 13:46:19 -07:00 |
|
Noah Levitt
|
52433ade78
|
re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
|
2017-05-01 13:00:04 -07:00 |
|
Noah Levitt
|
000d40c4dc
|
Merge pull request #39 from bnewbold/bnewbold-pr-template
add a github PR template for this repo
|
2017-04-26 14:34:32 -07:00 |
|
bnewbold
|
83552eb444
|
add a github PR template for this repo
|
2017-04-26 14:10:24 -07:00 |
|
Noah Levitt
|
d972919db0
|
Merge pull request #36 from nlevitt/safe-thread-raise
safen up brozzler.thread_raise() to avoid interrupting rethinkdb tran…
|
2017-04-26 11:15:02 -07:00 |
|
Noah Levitt
|
27ee8d53f8
|
Merge pull request #38 from ato/headless-doc
update headless chrome instructions for regular chrome builds
|
2017-04-25 09:39:43 -07:00 |
|
Alex Osborne
|
69aba8b762
|
update headless chrome instructions for regular chrome builds
Also make it clearer that this hasn't been tested much.
|
2017-04-25 15:00:25 +10:00 |
|
Noah Levitt
|
dcf4811470
|
Merge branch 'master' into safe-thread-raise
|
2017-04-24 20:06:37 -07:00 |
|
Noah Levitt
|
d916b68ab9
|
use the new api with brozzler.thread_accept_exceptions()
|
2017-04-24 20:02:34 -07:00 |
|
Noah Levitt
|
0953e6972e
|
refactor thread_raise safety to use a context manager
|
2017-04-24 19:51:51 -07:00 |
|
Noah Levitt
|
f140e5bdbd
|
allow this stupid test to fail
|
2017-04-21 12:17:11 -07:00 |
|
Noah Levitt
|
ba519d7288
|
improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
|
2017-04-20 18:04:17 -07:00 |
|
Noah Levitt
|
7706bab8b8
|
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
|
2017-04-20 17:08:16 -07:00 |
|
Noah Levitt
|
b3fa7a4e39
|
quote that shell meta character
|
2017-04-18 18:46:59 -07:00 |
|
Noah Levitt
|
426916a238
|
need warcprox in python path for travis tests now
|
2017-04-18 18:10:18 -07:00 |
|
Noah Levitt
|
8256a34b4f
|
implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker
|
2017-04-18 17:54:12 -07:00 |
|
Noah Levitt
|
5603ff5380
|
have _warcprox_write_record also raise ProxyError when appropriate, and test this
|
2017-04-18 16:58:51 -07:00 |
|
Neil Minton
|
f541dce5c3
|
Crawl Google Calendar for fortstjames.ca
|
2017-04-18 15:22:33 -07:00 |
|
Noah Levitt
|
ac972d399f
|
fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth
|
2017-04-18 12:00:23 -07:00 |
|
Noah Levitt
|
dc43794363
|
raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
|
2017-04-17 18:15:22 -07:00 |
|
Noah Levitt
|
349b41ab32
|
raise new exception brozzler.ProxyError in case of proxy error browsing a page
|
2017-04-17 18:14:02 -07:00 |
|
Noah Levitt
|
87a7301f4d
|
make brozzle-page respect --proxy (no test for this!)
|
2017-04-17 18:11:09 -07:00 |
|
Noah Levitt
|
0e90950de2
|
oops, version bump for previous commit
|
2017-04-17 18:10:56 -07:00 |
|
Noah Levitt
|
0884b4cd56
|
bubble up proxy errors fetching robots.txt, with unit test, and documentation
|
2017-04-17 16:47:05 -07:00 |
|
Noah Levitt
|
df7734f2ca
|
new command line utility brozzler-stop-crawl, with tests
|
2017-04-14 18:06:15 -07:00 |
|
Noah Levitt
|
fae60e9960
|
parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run
|
2017-04-14 11:46:26 -07:00 |
|
Noah Levitt
|
b3cf746f53
|
stupid version number bump
|
2017-04-05 17:01:52 -07:00 |
|
Noah Levitt
|
62917a6f1a
|
Revert "bump version number for last pull request"
This reverts commit d192fc269eddeb8b06888e95bb6e4a6639e34415.
|
2017-04-05 17:01:06 -07:00 |
|
Noah Levitt
|
d192fc269e
|
bump version number for last pull request
|
2017-04-05 16:15:24 -07:00 |
|
Barbara Miller
|
537eb1cf7f
|
Merge pull request #34 from galgeek/ARI-5193
mouseover for ky.gov sites
|
2017-04-05 16:13:57 -07:00 |
|
Noah Levitt
|
5bcd10c228
|
extract area/@href links, and add test for outlink extraction
|
2017-04-05 12:09:48 -07:00 |
|
Barbara Miller
|
847b68eaf4
|
add JIRA info
|
2017-04-04 15:52:03 -07:00 |
|
Barbara Miller
|
901321199c
|
mouseover for ky.gov sites
|
2017-03-31 15:48:01 -07:00 |
|
Noah Levitt
|
d4d3ef4fd3
|
ugh fix version number
|
2017-03-30 17:53:36 -07:00 |
|
Noah Levitt
|
125d77b8c4
|
consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin
|
2017-03-29 18:49:04 -07:00 |
|
Noah Levitt
|
3d47805ec1
|
new model for crawling hashtags, each one is no longer a top-level page
|
2017-03-27 12:15:49 -07:00 |
|
Noah Levitt
|
a836269e95
|
remove some vestiges of old proxy stuff
|
2017-03-24 16:04:43 -07:00 |
|
Noah Levitt
|
a826fdc7ef
|
new test of frontier.seed_page
|
2017-03-24 15:45:40 -07:00 |
|
Noah Levitt
|
0e35de43b6
|
actually respect --proxy and --warcprox-auto options to brozzler-worker
|
2017-03-24 22:27:52 +00:00 |
|
Noah Levitt
|
934190084c
|
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so.
|
2017-03-24 13:55:23 -07:00 |
|
Noah Levitt
|
9a2f181eb6
|
back to a dev version number
|
2017-03-22 16:12:39 -07:00 |
|
Noah Levitt
|
613dca29dc
|
1.1b10 since 1.1b9 has bugs :(
1.1b10
|
2017-03-22 16:11:26 -07:00 |
|
Noah Levitt
|
4ba25db684
|
ugh, avoid infinite recursion
|
2017-03-22 15:53:58 -07:00 |
|
Noah Levitt
|
34bb64297f
|
fix frontier tests now that enable_warcprox_features is simply omitted by default
|
2017-03-22 15:46:12 -07:00 |
|
Noah Levitt
|
4aa611af52
|
i dub thee 1.1b9
1.1b9
|
2017-03-22 15:25:55 -07:00 |
|