Barbara Miller
6a0b0b058d
updates post-walkthru
2024-09-23 18:59:33 -07:00
Alex Dempsey
8b23430a87
Use black, enforce with GitHub Actions
2024-02-08 12:07:41 -08:00
Noah Levitt
7915220ab7
consider page completed after 3 failures
...
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807
"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
433b201b52
use logging.warning() to quiet py37 warnings
2019-04-09 01:43:38 -07:00
Noah Levitt
d729c8d0d5
use yaml.safe_load()
...
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Barbara Miller
e2b2542d4a
handle http auth ( #138 )
...
abort brozzling on insterstial (auth dialog)
because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
d0f5cd7168
tweak logging
2018-08-31 15:23:48 -07:00
Barbara Miller
745e6cc942
log behavior params better
2018-03-19 16:28:14 -07:00
jkafader
7d61673d3e
Merge pull request #97 from nlevitt/max-claimed-sites
...
Max claimed sites
2018-03-08 16:48:31 -08:00
Noah Levitt
d7512fbeb6
move time limit enforcement
...
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Vangelis Banos
ce473897a3
Disable Jinja2 template auto_reload for higher performance
...
Every time we run a JS behavior, we load a Jinja2 template.
By default, Jinja2 has option `auto_reload=True`. This mean that
every time a template is requested the loader checks if the source file changed
and if yes, it will reload the template. For higher performance it’s possible
to disable that.
Also note that Jinja caches 400 templates by default.
Ref: http://jinja.pocoo.org/docs/2.10/api/
In Brozzler, we don't make changes to JS templates while the system is
running. So, there is no point in having auto_reload=True.
2018-02-25 20:24:25 +00:00
Noah Levitt
0faeaab3ac
fix attempt for deadlock-ish situation
...
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Vangelis Banos
3984ca017f
Replace cwd var with d
2018-01-09 06:33:03 +00:00
Vangelis Banos
dacfba330c
Configurable JS templates location
...
Brozzler has hard-coded the JS templates logic in ``brozzler/behaviors.yaml``
and ``brozzler/js-templates/`` locations. With this change, you can use
the optional ``behaviors_dir`` ``browser.browse_page`` parameter to set a
custom location and use any potential JS behaviors.
2018-01-04 17:37:02 +00:00
Noah Levitt
4d7f4518b5
use %r instead of calling repr()
2017-06-07 13:07:42 -07:00
Noah Levitt
d904daea9c
remove stray logging
2017-05-24 11:36:06 -07:00
Noah Levitt
89e7c8b079
fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called
2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97
improve thread_raise() so that the new tests pass
...
1. If thread is not currently accepting exceptions, queue it and raise if and
when it does start accepting them. This fixes problem of thread_raise
exceptions being ignored when raised just before the target thread starts
accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
thread in quick succession by ensuring that only one is actually raised for
a given `with` block. This type of occurrence had been putting brozzler into
a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
0953e6972e
refactor thread_raise safety to use a context manager
2017-04-24 19:51:51 -07:00
Noah Levitt
7706bab8b8
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
2017-04-20 17:08:16 -07:00
Noah Levitt
349b41ab32
raise new exception brozzler.ProxyError in case of proxy error browsing a page
2017-04-17 18:14:02 -07:00
Noah Levitt
df7734f2ca
new command line utility brozzler-stop-crawl, with tests
2017-04-14 18:06:15 -07:00
Noah Levitt
125d77b8c4
consolidate job.py and site.py into model.py, and let Job and Site share the elapsed() method by way of a mixin
2017-03-29 18:49:04 -07:00
Noah Levitt
eeee523b18
three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things
2017-03-20 17:28:16 -07:00
Noah Levitt
12fb9eaa15
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 14:59:51 -07:00
Noah Levitt
700b08b7d7
use new rethinkstuff ORM
2017-02-28 16:12:50 -08:00
Noah Levitt
129a1e8f47
use underscore convention
2017-02-02 11:52:19 -08:00
Noah Levitt
5f4c5190da
improve TRACE level logging
2017-02-02 11:41:40 -08:00
Noah Levitt
ed2d58d87d
stopgap fix for problem where an attempt to save a screenshot of a url with a hash tag containing spaces or non-ascii characters would fail, causing the whole brozzle of the page to fail, and end up in a retry loop (better handling of hash tags is planned which will obviate this change)
2017-02-01 22:39:12 +00:00
Noah Levitt
f7427219cf
restore handling of "aw snap" or "he's dead jim"
2016-12-21 14:21:20 -08:00
Noah Levitt
a0b61408b9
convert behaviors to jinja2, move them to new subdir js-templates, along with javascript previously stored as a string in browser.py
2016-12-20 16:33:25 -08:00
Noah Levitt
86ac48d6c3
generalized support for login doing automatic detection of login form on a page
2016-12-19 17:30:09 -08:00
Noah Levitt
c71854127d
major refactoring of browsing code to make it easier to add functionality
2016-12-15 16:42:45 -08:00
Noah Levitt
5bd4908e1d
punycode host part of url to avoid errors doing WARCPROX_WRITE_RECORD
2016-10-26 13:50:23 -07:00
Noah Levitt
c902a70450
tweak thread names
2016-07-19 14:33:57 -05:00
Noah Levitt
479713e25b
--trace level logging
2016-06-29 18:29:45 -05:00
Noah Levitt
df61e55b6b
add license headers
2016-04-25 20:02:11 +00:00
Noah Levitt
b06381790c
honor crawl job stop requests
2016-03-08 00:18:54 +00:00
Adam Miller
20bde1c482
uncommented init imports, removed required job_id in Frontier.finished
2015-10-22 22:29:24 +00:00
Noah Levitt
a94dfd27f8
oops, set brozzler.__version__
2015-09-24 00:34:51 +00:00
Noah Levitt
8c69ca3b39
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
2015-09-24 00:17:33 +00:00
Noah Levitt
40522ef5a5
fix some rethinkdb related stuff; most notably r.desc() and related stuff don't currently work correctly if r is a Rethinker, so use rethinkdb directly in that case
2015-09-23 01:53:05 +00:00
Noah Levitt
c682627aec
Rethinker moved to pyrethink library
2015-09-16 19:24:17 +00:00
Noah Levitt
92a288bc35
detect jobs finishing! (not well tested yet)
2015-09-09 22:11:48 +00:00
Noah Levitt
5fe2805285
fix bug claiming site, looks like there could be a race condition with other worker claiming the same site
2015-09-04 01:36:29 +00:00
Noah Levitt
3c23aa8fd4
finally, the jobs table
2015-09-03 01:05:03 +00:00
Noah Levitt
f334107b47
support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell)
2015-08-28 00:37:26 +00:00
Noah Levitt
efa640c640
refactor to simplify starting new job from code
2015-08-25 19:52:33 +00:00
Noah Levitt
b8506a2ab4
rename "db" to "frontier"
2015-08-19 17:47:05 +00:00
Noah Levitt
a878730e02
goodbye sqlite and rabbitmq, hello rethinkdb
2015-08-18 21:44:54 +00:00