1483 Commits

Author SHA1 Message Date
Noah Levitt
19522aff85 adjusting ansible config for xenial
untested because of vagrant problems
2019-03-19 16:37:13 -07:00
Noah Levitt
d4f8bc768f trying to make this work with xenial for travis
see error https://travis-ci.org/internetarchive/brozzler/jobs/508141058
2019-03-18 16:38:23 -07:00
Noah Levitt
f2a9908395 travis only has py 3.7 for xenial 2019-03-18 16:20:54 -07:00
Noah Levitt
d729c8d0d5 use yaml.safe_load()
getting new warnings
see https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
2019-03-18 15:49:44 -07:00
Noah Levitt
6f5f090c33 test py 3.7 2019-03-18 15:49:03 -07:00
Noah Levitt
ef981706f4 fix rethinkdb dependency version 2019-03-18 15:08:36 -07:00
Noah Levitt
61274ae994 peg to working doublethink
see: https://github.com/internetarchive/doublethink/commit/f7fc7da725c9b
2019-03-14 20:04:09 +00:00
Noah Levitt
7d5bb4b5d4
Merge pull request #148 from vbanos/disk-cache
Add disk cache options to Chrome
2019-02-12 14:39:49 -08:00
Vangelis Banos
9c48a6fa11 Use disk cache params only on Chrome.start
Use `disk_cache_dir` and `disk_cache_size` only on `Chrome.start` and
not on `Chrome.__init__`.

Drop `disk_cache_dir` and `disk_cache_size` class attributes.
2019-02-12 20:59:08 +00:00
Vangelis Banos
adeca823dd Remove stale comment 2019-02-12 07:21:44 +00:00
Vangelis Banos
31e611771e Improve disk cache options
Remove `--disable-cache`, its not used any more.

Rename `disk_cache` to `disk_cache_dir` and use only path (str)
argument.

Decouple `--disk-cache-size` from `--disk-cache-dir` so it is possible
to use either or both.
2019-02-07 07:42:45 +00:00
Vangelis Banos
c288c9ae98 Add disk cache options to Chrome
Add `Chrome` options `disk_cache` and `disk_cache_size` which add chromium
options `--disk-cache-dir=<DIR>` and `--disk-cache-size=N` (bytes).
The default is to use `--disable-cache` (no disk caching).

There are two ways to use the new vars, if you just use
`Chrome(disk_cache=True)` the chromium cli option `--disable-cache` is
NOT used and chromium writes disk cache inside profile dir.

If you use `Chrome(disk_cache='/tmp/custom_dir', disk_cache_size=10000)`
chromium will use `--disk-cache-dir=/tmp/custom_dir
--disk-cache-size=10000`.
2019-02-06 16:22:10 +00:00
Noah Levitt
809ea3885f
Merge pull request #147 from galgeek/bye_simpleclicks
no more simpleclicks/mouseovers
2019-01-14 13:48:48 -08:00
Barbara Miller
f6ffb4acea update (C) 2019-01-10 16:11:24 -08:00
Barbara Miller
9001156b54 rm simpleclicks.js.j2 mouseovers.js.j2 2019-01-10 15:58:38 -08:00
Barbara Miller
770ea6de1e no more simpleclicks/mouseovers 2019-01-10 15:54:47 -08:00
Barbara Miller
e1ceb87ca2
Merge pull request #146 from nlevitt/https-redirect
least surprise on http/https seed redirects
2018-12-21 15:26:04 -08:00
Noah Levitt
a74f46dc53 least surprise on http/https seed redirects
if http://foo.com/ redirects to https://foo.com/a/b/c let's also
put all of https://foo.com/ in scope
2018-12-21 15:17:31 -08:00
Noah Levitt
6b8e597a43 bump version after merge 2018-12-20 11:30:49 -08:00
Noah Levitt
0a08c01461
Merge pull request #145 from galgeek/no-skipIframes
no skipIframes for umbraBehavior
2018-12-20 11:30:28 -08:00
Barbara Miller
047b46bc4e back out now unnecessary updates 2018-12-20 11:25:06 -08:00
Barbara Miller
d8f97e7b3f no current need for skipIframes with new try/catch 2018-12-20 11:24:30 -08:00
Noah Levitt
034f7938c4 catch common exception in default behavior 2018-12-20 10:46:05 -08:00
Noah Levitt
2cd64811b3 bump version after merge 2018-12-17 15:10:26 -08:00
Noah Levitt
d8c9dd2ff4
Merge pull request #144 from galgeek/umbraBehavior18q4
fix instagram captures; add skipIframe feature
2018-12-17 15:09:52 -08:00
Barbara Miller
4a0d95277f update umbraBehavior 2018-12-17 15:04:36 -08:00
Barbara Miller
425d44bf4a updates for jina2 2018-12-13 17:27:15 -08:00
Barbara Miller
6c21a9f773 iframe option and other instagram updates 2018-12-13 15:54:10 -08:00
Noah Levitt
15870e6010 avoid IndexError
in some cases we receive this event from the browser:
{"method":"ServiceWorker.workerVersionUpdated","params":{"versions":[]}}
2018-12-13 15:49:38 -08:00
Noah Levitt
b577fe3c36 log browser uncaught exceptions at debug level
didn't realize these weren't showing up as console messages
2018-12-13 15:45:35 -08:00
Noah Levitt
ebcc063fe2 bump version after merge 2018-11-29 14:52:11 -08:00
jkafader
898756690f
Merge pull request #142 from nlevitt/service-worker
fetch service worker script with proper headers
2018-11-29 13:42:59 -08:00
jkafader
9c27e829aa
Merge pull request #136 from nlevitt/revert-time-limit
change time limit enforcement
2018-11-29 12:29:35 -08:00
Noah Levitt
db62402be8 fix tests 2018-11-27 14:35:00 -08:00
Noah Levitt
f63947cfe9 fetch service worker script with proper headers 2018-11-27 12:35:33 -08:00
Noah Levitt
574af7846e bump version after merge 2018-11-16 15:10:46 -08:00
Barbara Miller
e2b2542d4a handle http auth (#138)
abort brozzling on insterstial (auth dialog)

because we have no other recourse at this point. waiting on Network.requestIntercepted auth challenge support. (didn't work in our latest testing)
https://chromedevtools.github.io/devtools-protocol/tot/Network#type-AuthChallengeResponse
2018-11-16 15:10:30 -08:00
Noah Levitt
05fab8b909 change time limit enforcement
enforce time limit based on all the time that a site was in active
rotation, including time it spent waiting for its turn to be brozzled;
this undoes the change from b9640b8a30c934, because now it seems that
was the wrong decision (brozzler jobs with many seeds and low
max_claimed_sites hanging around forever)
2018-11-12 16:21:38 -08:00
Noah Levitt
15610fa990 fail quickly if browser dies at startup
instead of trying to retrieve /json for 600 seconds
2018-11-01 15:57:03 -07:00
Noah Levitt
1073431f76 handle exceptions extracting links
like this one:
Uncaught DOMException: Blocked a frame with origin "https://www.youtube.com" from accessing a cross-origin frame.
    at __brzl_compileOutlinks (<anonymous>:4:24)
    at __brzl_compileOutlinks (<anonymous>:10:29)
    at <anonymous>:16:1
__brzl_compileOutlinks @ VM194:4
__brzl_compileOutlinks @ VM194:10

not sure exactly why this happens but we just have to handle it
2018-10-29 17:42:25 -07:00
Noah Levitt
af85f28908 fix reported chromium crash by removing argument
--single-process
https://github.com/internetarchive/brozzler/issues/128
2018-10-22 14:28:31 -07:00
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
jkafader
8fc800d1ef
Merge pull request #127 from nlevitt/ydl-improvements
Ydl improvements
2018-10-12 11:55:47 -07:00
Noah Levitt
65fad5e8bf remove stray bad logging line 2018-10-12 11:35:47 -07:00
Noah Levitt
7497b7e5ac tests expect outlinks to be a set 2018-10-12 11:03:54 -07:00
Noah Levitt
054ba6d7a0 tidy up some comments and docs 2018-10-12 00:48:38 -07:00
Noah Levitt
8f9077fbf3 watch pages as outlinks from youtube-dl playlists
and bypass downloading metadata about individual videos as well as the
videos themselves (for youtube playlists), because even just the
metadata can take many minutes or hours in case of thousands of videos
2018-10-12 00:41:16 -07:00
Noah Levitt
9211fb45ec silence youtube-dl's logging, use only our own
because youtube-dl's can be annoyingly verbose, confusing, doesn't tell
us the things we're interested in, and doesn't tell us where the
messages originate
2018-10-12 00:39:37 -07:00
Noah Levitt
e5536182dc use a thread-local callback in monkey-patched
finish_frag_download, instead of locking around monkey-patching, to
allow different threads to youtube-dl concurrently, but still not
interfere with each other
2018-10-11 23:28:34 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00