779 Commits

Author SHA1 Message Date
Noah Levitt
bdc0badec3 rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!) 2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c remove stray logging 2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6 use "ttl" for updated doublethink svc reg api 2017-05-23 11:33:04 -07:00
Noah Levitt
89e7c8b079 fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called 2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97 improve thread_raise() so that the new tests pass
1. If thread is not currently accepting exceptions, queue it and raise if and
   when it does start accepting them. This fixes problem of thread_raise
   exceptions being ignored when raised just before the target thread starts
   accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
   thread in quick succession by ensuring that only one is actually raised for
   a given `with` block. This type of occurrence had been putting brozzler into
   a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
d514eaec15 even more, better failing tests for thread_raise 2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87 failing test for forthcoming behavior of thread_raise 2017-05-15 16:20:20 -07:00
Noah Levitt
60c5a7c1c4 recognize ConnectionError (of which ConnectionResetError is a subclass) in _warcprox_write_record as a proxy error 2017-05-12 10:03:53 -07:00
Barbara Miller
054625b8a5 Merge pull request #40 from BitBaron/ari-4960
Crawl Google Calendar for fortstjames.ca
2017-05-09 14:12:48 -07:00
Noah Levitt
b4bf17df9b do a better job of making sure to shut down the browser when brozzle-page is killed 2017-05-03 16:43:31 -07:00
Noah Levitt
9d4cbbf6eb handle another rethinkdb outage corner case 2017-05-01 14:12:43 -07:00
Noah Levitt
389db01458 BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls 2017-05-01 13:46:19 -07:00
Noah Levitt
52433ade78 re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker 2017-05-01 13:00:04 -07:00
Noah Levitt
000d40c4dc Merge pull request #39 from bnewbold/bnewbold-pr-template
add a github PR template for this repo
2017-04-26 14:34:32 -07:00
bnewbold
83552eb444 add a github PR template for this repo 2017-04-26 14:10:24 -07:00
Noah Levitt
d972919db0 Merge pull request #36 from nlevitt/safe-thread-raise
safen up brozzler.thread_raise() to avoid interrupting rethinkdb tran…
2017-04-26 11:15:02 -07:00
Noah Levitt
27ee8d53f8 Merge pull request #38 from ato/headless-doc
update headless chrome instructions for regular chrome builds
2017-04-25 09:39:43 -07:00
Alex Osborne
69aba8b762 update headless chrome instructions for regular chrome builds
Also make it clearer that this hasn't been tested much.
2017-04-25 15:00:25 +10:00
Noah Levitt
dcf4811470 Merge branch 'master' into safe-thread-raise 2017-04-24 20:06:37 -07:00
Noah Levitt
d916b68ab9 use the new api with brozzler.thread_accept_exceptions() 2017-04-24 20:02:34 -07:00
Noah Levitt
0953e6972e refactor thread_raise safety to use a context manager 2017-04-24 19:51:51 -07:00
Noah Levitt
f140e5bdbd allow this stupid test to fail 2017-04-21 12:17:11 -07:00
Noah Levitt
ba519d7288 improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id 2017-04-20 18:04:17 -07:00
Noah Levitt
7706bab8b8 safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such 2017-04-20 17:08:16 -07:00
Noah Levitt
b3fa7a4e39 quote that shell meta character 2017-04-18 18:46:59 -07:00
Noah Levitt
426916a238 need warcprox in python path for travis tests now 2017-04-18 18:10:18 -07:00
Noah Levitt
8256a34b4f implement resilience to warcprox outage, i.e. deal with brozzler.ProxyError in brozzler-worker 2017-04-18 17:54:12 -07:00
Noah Levitt
5603ff5380 have _warcprox_write_record also raise ProxyError when appropriate, and test this 2017-04-18 16:58:51 -07:00
Neil Minton
f541dce5c3 Crawl Google Calendar for fortstjames.ca 2017-04-18 15:22:33 -07:00
Noah Levitt
ac972d399f fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth 2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363 raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch 2017-04-17 18:15:22 -07:00
Noah Levitt
349b41ab32 raise new exception brozzler.ProxyError in case of proxy error browsing a page 2017-04-17 18:14:02 -07:00
Noah Levitt
87a7301f4d make brozzle-page respect --proxy (no test for this!) 2017-04-17 18:11:09 -07:00
Noah Levitt
0e90950de2 oops, version bump for previous commit 2017-04-17 18:10:56 -07:00
Noah Levitt
0884b4cd56 bubble up proxy errors fetching robots.txt, with unit test, and documentation 2017-04-17 16:47:05 -07:00
Noah Levitt
df7734f2ca new command line utility brozzler-stop-crawl, with tests 2017-04-14 18:06:15 -07:00
Noah Levitt
fae60e9960 parameterize command line entry points and add tests of --version, a rudimentary check that the commands at least run 2017-04-14 11:46:26 -07:00
Noah Levitt
b3cf746f53 stupid version number bump 2017-04-05 17:01:52 -07:00
Noah Levitt
62917a6f1a Revert "bump version number for last pull request"
This reverts commit d192fc269eddeb8b06888e95bb6e4a6639e34415.
2017-04-05 17:01:06 -07:00
Noah Levitt
d192fc269e bump version number for last pull request 2017-04-05 16:15:24 -07:00
Barbara Miller
537eb1cf7f Merge pull request #34 from galgeek/ARI-5193
mouseover for ky.gov sites
2017-04-05 16:13:57 -07:00
Noah Levitt
5bcd10c228 extract area/@href links, and add test for outlink extraction 2017-04-05 12:09:48 -07:00
Barbara Miller
847b68eaf4 add JIRA info 2017-04-04 15:52:03 -07:00
Barbara Miller
901321199c mouseover for ky.gov sites 2017-03-31 15:48:01 -07:00
Noah Levitt
d4d3ef4fd3 ugh fix version number 2017-03-30 17:53:36 -07:00
Noah Levitt
125d77b8c4 consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin 2017-03-29 18:49:04 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
a836269e95 remove some vestiges of old proxy stuff 2017-03-24 16:04:43 -07:00
Noah Levitt
a826fdc7ef new test of frontier.seed_page 2017-03-24 15:45:40 -07:00
Noah Levitt
0e35de43b6 actually respect --proxy and --warcprox-auto options to brozzler-worker 2017-03-24 22:27:52 +00:00