Noah Levitt
9c658cddf7
fix a couple of svc definitions
2019-03-24 16:06:36 -07:00
Noah Levitt
48bb03418f
daemontools
2019-03-23 00:26:39 -07:00
Noah Levitt
18b4a26db6
porting ansible config to xenial
...
no more upstart, switch to daemontools, among other things
2019-03-22 23:50:46 -07:00
Noah Levitt
19522aff85
adjusting ansible config for xenial
...
untested because of vagrant problems
2019-03-19 16:37:13 -07:00
Noah Levitt
d4f8bc768f
trying to make this work with xenial for travis
...
see error https://travis-ci.org/internetarchive/brozzler/jobs/508141058
2019-03-18 16:38:23 -07:00
Noah Levitt
f2a9908395
travis only has py 3.7 for xenial
2019-03-18 16:20:54 -07:00
Noah Levitt
d729c8d0d5
use yaml.safe_load()
...
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Noah Levitt
6f5f090c33
test py 3.7
2019-03-18 15:49:03 -07:00
Noah Levitt
ef981706f4
fix rethinkdb dependency version
2019-03-18 15:08:36 -07:00
Noah Levitt
61274ae994
peg to working doublethink
...
see: https://github.com/internetarchive/doublethink/commit/f7fc7da725c9b
2019-03-14 20:04:09 +00:00
Noah Levitt
7d5bb4b5d4
Merge pull request #148 from vbanos/disk-cache
...
Add disk cache options to Chrome
2019-02-12 14:39:49 -08:00
Vangelis Banos
9c48a6fa11
Use disk cache params only on Chrome.start
...
Use `disk_cache_dir` and `disk_cache_size` only on `Chrome.start` and
not on `Chrome.__init__`.
Drop `disk_cache_dir` and `disk_cache_size` class attributes.
2019-02-12 20:59:08 +00:00
Vangelis Banos
adeca823dd
Remove stale comment
2019-02-12 07:21:44 +00:00
Vangelis Banos
31e611771e
Improve disk cache options
...
Remove `--disable-cache`, its not used any more.
Rename `disk_cache` to `disk_cache_dir` and use only path (str)
argument.
Decouple `--disk-cache-size` from `--disk-cache-dir` so it is possible
to use either or both.
2019-02-07 07:42:45 +00:00
Vangelis Banos
c288c9ae98
Add disk cache options to Chrome
...
Add `Chrome` options `disk_cache` and `disk_cache_size` which add chromium
options `--disk-cache-dir=<DIR>` and `--disk-cache-size=N` (bytes).
The default is to use `--disable-cache` (no disk caching).
There are two ways to use the new vars, if you just use
`Chrome(disk_cache=True)` the chromium cli option `--disable-cache` is
NOT used and chromium writes disk cache inside profile dir.
If you use `Chrome(disk_cache='/tmp/custom_dir', disk_cache_size=10000)`
chromium will use `--disk-cache-dir=/tmp/custom_dir
--disk-cache-size=10000`.
2019-02-06 16:22:10 +00:00
Noah Levitt
809ea3885f
Merge pull request #147 from galgeek/bye_simpleclicks
...
no more simpleclicks/mouseovers
2019-01-14 13:48:48 -08:00
Barbara Miller
f6ffb4acea
update (C)
2019-01-10 16:11:24 -08:00
Barbara Miller
9001156b54
rm simpleclicks.js.j2 mouseovers.js.j2
2019-01-10 15:58:38 -08:00
Barbara Miller
770ea6de1e
no more simpleclicks/mouseovers
2019-01-10 15:54:47 -08:00
Barbara Miller
e1ceb87ca2
Merge pull request #146 from nlevitt/https-redirect
...
least surprise on http/https seed redirects
2018-12-21 15:26:04 -08:00
Noah Levitt
a74f46dc53
least surprise on http/https seed redirects
...
if http://foo.com/ redirects to https://foo.com/a/b/c let's also
put all of https://foo.com/ in scope
2018-12-21 15:17:31 -08:00
Noah Levitt
6b8e597a43
bump version after merge
2018-12-20 11:30:49 -08:00
Noah Levitt
0a08c01461
Merge pull request #145 from galgeek/no-skipIframes
...
no skipIframes for umbraBehavior
2018-12-20 11:30:28 -08:00
Barbara Miller
047b46bc4e
back out now unnecessary updates
2018-12-20 11:25:06 -08:00
Barbara Miller
d8f97e7b3f
no current need for skipIframes with new try/catch
2018-12-20 11:24:30 -08:00
Noah Levitt
034f7938c4
catch common exception in default behavior
2018-12-20 10:46:05 -08:00
Noah Levitt
2cd64811b3
bump version after merge
2018-12-17 15:10:26 -08:00
Noah Levitt
d8c9dd2ff4
Merge pull request #144 from galgeek/umbraBehavior18q4
...
fix instagram captures; add skipIframe feature
2018-12-17 15:09:52 -08:00
Barbara Miller
4a0d95277f
update umbraBehavior
2018-12-17 15:04:36 -08:00
Barbara Miller
425d44bf4a
updates for jina2
2018-12-13 17:27:15 -08:00
Barbara Miller
6c21a9f773
iframe option and other instagram updates
2018-12-13 15:54:10 -08:00
Noah Levitt
15870e6010
avoid IndexError
...
in some cases we receive this event from the browser:
{"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}
2018-12-13 15:49:38 -08:00
Noah Levitt
b577fe3c36
log browser uncaught exceptions at debug level
...
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Noah Levitt
ebcc063fe2
bump version after merge
2018-11-29 14:52:11 -08:00
jkafader
898756690f
Merge pull request #142 from nlevitt/service-worker
...
fetch service worker script with proper headers
2018-11-29 13:42:59 -08:00
jkafader
9c27e829aa
Merge pull request #136 from nlevitt/revert-time-limit
...
change time limit enforcement
2018-11-29 12:29:35 -08:00
Noah Levitt
db62402be8
fix tests
2018-11-27 14:35:00 -08:00
Noah Levitt
f63947cfe9
fetch service worker script with proper headers
2018-11-27 12:35:33 -08:00
Noah Levitt
574af7846e
bump version after merge
2018-11-16 15:10:46 -08:00
Barbara Miller
e2b2542d4a
handle http auth ( #138 )
...
abort brozzling on insterstial (auth dialog)
because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
05fab8b909
change time limit enforcement
...
enforce time limit based on all the time that a site was in active
rotation, including time it spent waiting for its turn to be brozzled;
this undoes the change from b9640b8a30c934, because now it seems that
was the wrong decision (brozzler jobs with many seeds and low
max_claimed_sites hanging around forever)
2018-11-12 16:21:38 -08:00
Noah Levitt
15610fa990
fail quickly if browser dies at startup
...
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
1073431f76
handle exceptions extracting links
...
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com " from accessing a cross-origin frame.
at __brzl_compileOutlinks (<anonymous>:4:24)
at __brzl_compileOutlinks (<anonymous>:10:29)
at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10
not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908
fix reported chromium crash by removing argument
...
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Noah Levitt
20996fa501
bump version after merge
2018-10-12 12:46:09 -07:00
jkafader
8fc800d1ef
Merge pull request #127 from nlevitt/ydl-improvements
...
Ydl improvements
2018-10-12 11:55:47 -07:00
Noah Levitt
65fad5e8bf
remove stray bad logging line
2018-10-12 11:35:47 -07:00
Noah Levitt
7497b7e5ac
tests expect outlinks to be a set
2018-10-12 11:03:54 -07:00
Noah Levitt
054ba6d7a0
tidy up some comments and docs
2018-10-12 00:48:38 -07:00
Noah Levitt
8f9077fbf3
watch pages as outlinks from youtube-dl playlists
...
and bypass downloading metadata about individual videos as well as the
videos themselves (for youtube playlists), because even just the
metadata can take many minutes or hours in case of thousands of videos
2018-10-12 00:41:16 -07:00