58 Commits

Author SHA1 Message Date
Barbara Miller
6a0b0b058d updates post-walkthru 2024-09-23 18:59:33 -07:00
Alex Dempsey
8b23430a87 Use black, enforce with GitHub Actions 2024-02-08 12:07:41 -08:00
Noah Levitt
7915220ab7 consider page completed after 3 failures
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807

"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
433b201b52 use logging.warning() to quiet py37 warnings 2019-04-09 01:43:38 -07:00
Noah Levitt
d729c8d0d5 use yaml.safe_load()
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Barbara Miller
e2b2542d4a handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
d0f5cd7168 tweak logging 2018-08-31 15:23:48 -07:00
Barbara Miller
745e6cc942 log behavior params better 2018-03-19 16:28:14 -07:00
jkafader
7d61673d3e
Merge pull request #97 from nlevitt/max-claimed-sites
Max claimed sites
2018-03-08 16:48:31 -08:00
Noah Levitt
d7512fbeb6 move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Vangelis Banos
ce473897a3 Disable Jinja2 template auto_reload for higher performance
Every time we run a JS behavior, we load a Jinja2 template.
By default, Jinja2 has option `auto_reload=True`. This mean that
every time a template is requested the loader checks if the source file changed
and if yes, it will reload the template. For higher performance it’s possible
to disable that.

Also note that Jinja caches 400 templates by default.

Ref: http://jinja.pocoo.org/docs/2.10/api/

In Brozzler, we don't make changes to JS templates while the system is
running. So, there is no point in having auto_reload=True.
2018-02-25 20:24:25 +00:00
Noah Levitt
0faeaab3ac fix attempt for deadlock-ish situation
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Vangelis Banos
3984ca017f Replace cwd var with d 2018-01-09 06:33:03 +00:00
Vangelis Banos
dacfba330c Configurable JS templates location
Brozzler has hard-coded the JS templates logic in  ``brozzler/behaviors.yaml``
and ``brozzler/js-templates/`` locations. With this change, you can use
the optional ``behaviors_dir`` ``browser.browse_page`` parameter to set a
custom location and use any potential JS behaviors.
2018-01-04 17:37:02 +00:00
Noah Levitt
4d7f4518b5 use %r instead of calling repr() 2017-06-07 13:07:42 -07:00
Noah Levitt
d904daea9c remove stray logging 2017-05-24 11:36:06 -07:00
Noah Levitt
89e7c8b079 fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called 2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97 improve thread_raise() so that the new tests pass
1. If thread is not currently accepting exceptions, queue it and raise if and
   when it does start accepting them. This fixes problem of thread_raise
   exceptions being ignored when raised just before the target thread starts
   accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
   thread in quick succession by ensuring that only one is actually raised for
   a given `with` block. This type of occurrence had been putting brozzler into
   a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
0953e6972e refactor thread_raise safety to use a context manager 2017-04-24 19:51:51 -07:00
Noah Levitt
7706bab8b8 safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such 2017-04-20 17:08:16 -07:00
Noah Levitt
349b41ab32 raise new exception brozzler.ProxyError in case of proxy error browsing a page 2017-04-17 18:14:02 -07:00
Noah Levitt
df7734f2ca new command line utility brozzler-stop-crawl, with tests 2017-04-14 18:06:15 -07:00
Noah Levitt
125d77b8c4 consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin 2017-03-29 18:49:04 -07:00
Noah Levitt
eeee523b18 three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things 2017-03-20 17:28:16 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
700b08b7d7 use new rethinkstuff ORM 2017-02-28 16:12:50 -08:00
Noah Levitt
129a1e8f47 use underscore convention 2017-02-02 11:52:19 -08:00
Noah Levitt
5f4c5190da improve TRACE level logging 2017-02-02 11:41:40 -08:00
Noah Levitt
ed2d58d87d stopgap fix for problem where an attempt to save a screenshot of a url with a hash tag containing spaces or non-ascii characters would fail, causing the whole brozzle of the page to fail, and end up in a retry loop (better handling of hash tags is planned which will obviate this change) 2017-02-01 22:39:12 +00:00
Noah Levitt
f7427219cf restore handling of "aw snap" or "he's dead jim" 2016-12-21 14:21:20 -08:00
Noah Levitt
a0b61408b9 convert behaviors to jinja2, move them to new subdir js-templates, along with javascript previously stored as a string in browser.py 2016-12-20 16:33:25 -08:00
Noah Levitt
86ac48d6c3 generalized support for login doing automatic detection of login form on a page 2016-12-19 17:30:09 -08:00
Noah Levitt
c71854127d major refactoring of browsing code to make it easier to add functionality 2016-12-15 16:42:45 -08:00
Noah Levitt
5bd4908e1d punycode host part of url to avoid errors doing WARCPROX_WRITE_RECORD 2016-10-26 13:50:23 -07:00
Noah Levitt
c902a70450 tweak thread names 2016-07-19 14:33:57 -05:00
Noah Levitt
479713e25b --trace level logging 2016-06-29 18:29:45 -05:00
Noah Levitt
df61e55b6b add license headers 2016-04-25 20:02:11 +00:00
Noah Levitt
b06381790c honor crawl job stop requests 2016-03-08 00:18:54 +00:00
Adam Miller
20bde1c482 uncommented init imports, removed required job_id in Frontier.finished 2015-10-22 22:29:24 +00:00
Noah Levitt
a94dfd27f8 oops, set brozzler.__version__ 2015-09-24 00:34:51 +00:00
Noah Levitt
8c69ca3b39 giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2015-09-24 00:17:33 +00:00
Noah Levitt
40522ef5a5 fix some rethinkdb related stuff; most notably r.desc() and related stuff don't currently work correctly if r is a Rethinker, so use rethinkdb directly in that case 2015-09-23 01:53:05 +00:00
Noah Levitt
c682627aec Rethinker moved to pyrethink library 2015-09-16 19:24:17 +00:00
Noah Levitt
92a288bc35 detect jobs finishing! (not well tested yet) 2015-09-09 22:11:48 +00:00
Noah Levitt
5fe2805285 fix bug claiming site, looks like there could be a race condition with other worker claiming the same site 2015-09-04 01:36:29 +00:00
Noah Levitt
3c23aa8fd4 finally, the jobs table 2015-09-03 01:05:03 +00:00
Noah Levitt
f334107b47 support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2015-08-28 00:37:26 +00:00
Noah Levitt
efa640c640 refactor to simplify starting new job from code 2015-08-25 19:52:33 +00:00
Noah Levitt
b8506a2ab4 rename "db" to "frontier" 2015-08-19 17:47:05 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00