Noah Levitt
7915220ab7
consider page completed after 3 failures
...
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807
"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
d729c8d0d5
use yaml.safe_load()
...
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Noah Levitt
a74f46dc53
least surprise on http/https seed redirects
...
if http://foo.com/ redirects to https://foo.com/a/b/c let's also
put all of https://foo.com/ in scope
2018-12-21 15:17:31 -08:00
Noah Levitt
db62402be8
fix tests
2018-11-27 14:35:00 -08:00
Noah Levitt
e7d2273856
fix failing tests
2018-08-16 11:40:54 -07:00
Noah Levitt
c52c16c260
fix bug in test, add another one
2018-06-22 16:10:23 -05:00
Noah Levitt
aeb7c3f825
treat any error fetching robots.txt as "allow all"
2018-06-22 14:50:57 -05:00
Noah Levitt
05f8ab3495
fix more tests for new approach sans scope['surt']
2018-05-14 15:38:28 -07:00
Noah Levitt
bf5401283e
new test test_needs_browsing
...
currently exposes bug in resolving "location" response header
2018-01-26 10:59:18 -08:00
Noah Levitt
d514eaec15
even more, better failing tests for thread_raise
2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87
failing test for forthcoming behavior of thread_raise
2017-05-15 16:20:20 -07:00
Noah Levitt
0953e6972e
refactor thread_raise safety to use a context manager
2017-04-24 19:51:51 -07:00
Noah Levitt
7706bab8b8
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
2017-04-20 17:08:16 -07:00
Noah Levitt
5603ff5380
have _warcprox_write_record also raise ProxyError when appropriate, and test this
2017-04-18 16:58:51 -07:00
Noah Levitt
ac972d399f
fix robots.txt proxy down test by setting site.id (cached robots is stored by site.id, and other tests that ran earlier with no site.id were interfering); and test another kind of connection error, for whatever that's worth
2017-04-18 12:00:23 -07:00
Noah Levitt
dc43794363
raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
2017-04-17 18:15:22 -07:00
Noah Levitt
0884b4cd56
bubble up proxy errors fetching robots.txt, with unit test, and documentation
2017-04-17 16:47:05 -07:00
Noah Levitt
12fb9eaa15
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 14:59:51 -07:00
Noah Levitt
40bbbb3524
add tests of backwards compatibility handling of start/stop times and fix a bug or two
2017-03-02 16:53:24 -08:00
Noah Levitt
569af05b11
rethinkstuff is now "doublethink
2017-03-02 12:48:45 -08:00
Noah Levitt
2398031010
let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port
2017-02-22 12:44:19 -08:00
Noah Levitt
b409e49cfa
deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping
2017-02-15 16:46:45 -08:00
Noah Levitt
86d6060a2d
loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated
2016-12-20 17:52:21 -08:00
Noah Levitt
9bcec54f4b
fix _find_available_port and its unit test
2016-12-07 14:08:34 -08:00
Noah Levitt
eed8b9ec30
little fixes
2016-12-07 11:20:10 -08:00
Noah Levitt
ce03381b92
move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test
2016-12-06 17:12:20 -08:00
Noah Levitt
3aead6de93
monkey-patch reppy to support substring user-agent matching
2016-11-16 11:41:34 -08:00