1535 Commits

Author SHA1 Message Date
Martin Czygan
8e670ca814 readme: remove proxy from job configuration
It has been removed in 934190084c73699747cf3f4c4d2ee7e268927eae.
2020-07-28 22:21:05 +02:00
Barbara Miller
e3a067cf60 youtube-dl option noplaylist: True 2020-07-24 16:22:50 -07:00
jkafader
1b9ebca13c
Merge pull request #202 from galgeek/limit_downloadThroughput
configurable limit for Chromium download throughput
2020-07-23 14:14:20 -07:00
Barbara Miller
739d09294e make configurable 2020-07-14 10:12:28 -07:00
Barbara Miller
36b4f80350 try SPN2 downloadThroughput limit 2020-07-14 10:12:28 -07:00
Barbara Miller
03594413f9
Merge pull request #200 from NGTmeaty/fix-test
Merging for the current fixes—thanks, @NGTmeaty!
2020-06-18 13:22:41 -07:00
NGTmeaty
25313a97de
Fix tests:
Update the RethinkDB pubkey location and repo location based on their guide https://rethinkdb.com/docs/install/ubuntu/
Numpy has updated to no longer support 3.5, on 3.5, we should install a earlier version of Numpy to maintain compatibility.
2020-06-02 03:26:33 -04:00
Neil Minton
3c5d1f24e0
Merge pull request #199 from galgeek/ARI-6097
instagram selector update
2020-05-26 17:11:35 -04:00
Barbara Miller
8da3ae9274 instagram update 2020-05-07 17:55:26 -07:00
jkafader
212111f581
Merge pull request #196 from galgeek/no-cache-dir-ydl
youtube-dl cache_dir: False
2020-04-30 15:05:00 -07:00
Barbara Miller
926de9c853 cache_dir: False 2020-04-30 11:11:38 -07:00
Barbara Miller
5b2381ef1f
bump version after merge 2020-04-22 10:54:06 -07:00
jkafader
17f173f12a
Merge pull request #162 from galgeek/ARI-5980
capture onclick links...
2020-04-22 10:07:04 -07:00
Barbara Miller
4df280a9b6
Merge pull request #194 from NGTmeaty/improve-login
Expanding Brozzler's logging in capabilities...

Thanks, @NGTmeaty and @vbanos! 

A couple of qa test crawls show the new code works as advertised.
2020-04-18 19:30:40 -07:00
Barbara Miller
04fba79d34 faster regex match 2020-04-16 18:09:03 -07:00
Jake L
09f938410a
Lower amount of times querySelectorAll is called.
Fix formatting issues.
2020-04-15 17:26:24 -04:00
Jake L
78365c9f35
Expanding Brozzler's logging in capabilities
Some sites don't allow you to login without clicking on a button to open a retracted modal.

This update to the login code allows Brozzler to click on all elements that we think are related to opening a login modal.

Then, if there isn't a regular form, we will attempt to fill out abnormal form schemes.

The test_try_login test has been expanded for the new type of login form we are supporting.
2020-04-14 17:19:53 -04:00
Barbara Miller
973af2c16e
bump version after merge, update copyright 2020-04-14 09:44:20 -07:00
Barbara Miller
a8734bcc11
Merge pull request #193 from vbanos/login-tests
Thanks, @vbanos!
2020-04-14 09:42:20 -07:00
Vangelis Banos
041feaf426 Add missing super().do_POST() 2020-04-14 09:39:48 +00:00
Barbara Miller
ae7248fff0 add dblclick (and fix typo) 2020-04-13 19:38:18 -07:00
Vangelis Banos
782aab3048 Add unit tests for try_login behavior
Add unit tests for the code that detects and tries to use login forms
automatically (`Browser.try_login`).

Add `htdocs/favicon.ico` because it is loaded automatically when the
browser tries to use the test web server and it causes a "missing"
warning.

Create a new dir `tests/htdocs/site11` which is used for login related
test html files.
2020-04-13 19:16:10 +00:00
Barbara Miller
a3b70fcb27 audio, too 2020-04-07 11:27:32 -07:00
jkafader
e22d80b9a4
Merge pull request #192 from galgeek/ss-fix
ss['stop'] not alway set here
2020-04-07 11:12:12 -07:00
Barbara Miller
d2b8171fb0 logging 2020-04-07 09:27:21 -07:00
Barbara Miller
f4f0c02064 ss['stop'] not alway set 2020-04-06 18:37:13 -07:00
Barbara Miller
401ba7293c
Merge pull request #191 from galgeek/skip-behaviors-on-error
bump version after merge
2020-04-02 14:24:56 -07:00
Barbara Miller
3647939af5 bump version after merge 2020-04-02 12:37:56 -07:00
Barbara Miller
ffea189d15
Merge pull request #190 from vbanos/skip-behaviors-on-error
Thank you, @vbanos!
2020-04-01 20:27:22 -07:00
Vangelis Banos
80341b9106 Add option simpler404 to enable this behavior
It is disabled by default.
2020-04-01 16:08:43 +00:00
Barbara Miller
cebdb20972
Merge pull request #188 from galgeek/xfail-interstitial
xfail test — didn't we already merge this simple update?
2020-03-27 16:35:59 -07:00
Vangelis Banos
140c27abe8 Skip running behaviors when page is 4xx or 5xx
Currently, when we run `Browser.browse_page`, we run JS behaviors after
we navigate to a page regardless of its status.
Maybe the page wasn't found (4xx) or unreachable for any reason (5xx).
In that case, we could skip running behaviors to save time and
resources.

With this PR, we add a new var to store navigated page HTTP status in
`WebsockReceiverThread.page_status`. We use this in
`Browser.browser_page` to skip behaviors, outlink and hashtag extraction
when page status is 4xx/5xx.

Note that we don't skip screenshots as it could be useful to have a
picture of an error page in some cases.
2020-03-23 16:21:57 +00:00
jkafader
3b249333a4
Merge pull request #189 from galgeek/ARI-6041
icaew.com behavior
2020-03-12 15:07:49 -07:00
Barbara Miller
4c0785fbfc
Merge pull request #187 from internetarchive/optimizes-rethinkdb-load-query
With the last commit, the only test failure is unrelated test_brozzling.py::test_page_interstitial_exception (already marked xfail in qa).
2020-03-11 21:29:01 -07:00
Barbara Miller
c4beeefe01 address var 2020-03-11 20:56:52 -07:00
Barbara Miller
2dfe3632f5 xfail test 2020-03-11 20:37:30 -07:00
James Kafader
313cec3139 coerce to dict not list 2020-03-11 19:31:02 -07:00
James Kafader
b9c5e4b66c fix output format 2020-03-11 19:15:57 -07:00
James Kafader
3defd49677 new selection function, based on optimized query 2020-03-11 16:09:16 -07:00
jkafader
1d9a95dfc2
Merge pull request #186 from galgeek/simpler_choose_warcprox
Simpler choose warcprox
2020-03-11 14:16:57 -07:00
Barbara Miller
f8f7aa1dca maybe fewer warcproxes 2020-03-11 14:08:34 -07:00
Barbara Miller
d190122a6d random.choice 2020-03-11 14:00:07 -07:00
Barbara Miller
af39b8cc6f skip active_sites query 2020-03-11 13:40:37 -07:00
Barbara Miller
414d1579fc icaew.com behavior 2020-03-03 11:47:34 -08:00
Noah Levitt
c2a1ca018a bump version after merge 2019-12-10 10:43:01 -08:00
Noah Levitt
558a0dd615
Merge pull request #184 from nlevitt/limit-failures
consider page completed after 3 failures
2019-12-10 10:42:43 -08:00
Noah Levitt
597f2b5b33 reveal bad value when job conf validation fails 2019-12-04 15:11:53 -08:00
Noah Levitt
7915220ab7 consider page completed after 3 failures
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807

"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
060adaffd0
Merge pull request #182 from CorentinB/patch-2
Fix Facebook ads variant selector
2019-11-27 16:10:55 -08:00
Noah Levitt
5aeaf47b6b bump version after merge 2019-11-27 12:41:16 -08:00