28 Commits

Author SHA1 Message Date
Adam Miller
0f72233f3b Adding support for hop path information to be stored and passed along to warcprox 2021-08-31 19:44:55 +00:00
Noah Levitt
7915220ab7 consider page completed after 3 failures
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807

"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
d729c8d0d5 use yaml.safe_load()
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Noah Levitt
a74f46dc53 least surprise on http/https seed redirects
if http://foo.com/ redirects to https://foo.com/a/b/c let's also
put all of https://foo.com/ in scope
2018-12-21 15:17:31 -08:00
Noah Levitt
db62402be8 fix tests 2018-11-27 14:35:00 -08:00
Noah Levitt
e7d2273856 fix failing tests 2018-08-16 11:40:54 -07:00
Noah Levitt
c52c16c260 fix bug in test, add another one 2018-06-22 16:10:23 -05:00
Noah Levitt
aeb7c3f825 treat any error fetching robots.txt as "allow all" 2018-06-22 14:50:57 -05:00
Noah Levitt
05f8ab3495 fix more tests for new approach sans scope['surt'] 2018-05-14 15:38:28 -07:00
Noah Levitt
bf5401283e new test test_needs_browsing
currently exposes bug in resolving "location" response header
2018-01-26 10:59:18 -08:00
Noah Levitt
d514eaec15 even more, better failing tests for thread_raise 2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87 failing test for forthcoming behavior of thread_raise 2017-05-15 16:20:20 -07:00
Noah Levitt
0953e6972e refactor thread_raise safety to use a context manager 2017-04-24 19:51:51 -07:00
Noah Levitt
7706bab8b8 safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such 2017-04-20 17:08:16 -07:00
Noah Levitt
5603ff5380 have _warcprox_write_record also raise ProxyError when appropriate, and test this 2017-04-18 16:58:51 -07:00
Noah Levitt
ac972d399f fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth 2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363 raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch 2017-04-17 18:15:22 -07:00
Noah Levitt
0884b4cd56 bubble up proxy errors fetching robots.txt, with unit test, and documentation 2017-04-17 16:47:05 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
40bbbb3524 add tests of backwards compatibility handling of start/stop times and fix a bug or two 2017-03-02 16:53:24 -08:00
Noah Levitt
569af05b11 rethinkstuff is now "doublethink 2017-03-02 12:48:45 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
b409e49cfa deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping 2017-02-15 16:46:45 -08:00
Noah Levitt
86d6060a2d loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated 2016-12-20 17:52:21 -08:00
Noah Levitt
9bcec54f4b fix _find_available_port and its unit test 2016-12-07 14:08:34 -08:00
Noah Levitt
eed8b9ec30 little fixes 2016-12-07 11:20:10 -08:00
Noah Levitt
ce03381b92 move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test 2016-12-06 17:12:20 -08:00
Noah Levitt
3aead6de93 monkey-patch reppy to support substring user-agent matching 2016-11-16 11:41:34 -08:00