1358 Commits

Author SHA1 Message Date
Barbara Miller
3647939af5 bump version after merge 2020-04-02 12:37:56 -07:00
Barbara Miller
ffea189d15
Merge pull request #190 from vbanos/skip-behaviors-on-error
Thank you, @vbanos!
2020-04-01 20:27:22 -07:00
Vangelis Banos
80341b9106 Add option simpler404 to enable this behavior
It is disabled by default.
2020-04-01 16:08:43 +00:00
Barbara Miller
cebdb20972
Merge pull request #188 from galgeek/xfail-interstitial
xfail test — didn't we already merge this simple update?
2020-03-27 16:35:59 -07:00
Vangelis Banos
140c27abe8 Skip running behaviors when page is 4xx or 5xx
Currently, when we run `Browser.browse_page`, we run JS behaviors after
we navigate to a page regardless of its status.
Maybe the page wasn't found (4xx) or unreachable for any reason (5xx).
In that case, we could skip running behaviors to save time and
resources.

With this PR, we add a new var to store navigated page HTTP status in
`WebsockReceiverThread.page_status`. We use this in
`Browser.browser_page` to skip behaviors, outlink and hashtag extraction
when page status is 4xx/5xx.

Note that we don't skip screenshots as it could be useful to have a
picture of an error page in some cases.
2020-03-23 16:21:57 +00:00
jkafader
3b249333a4
Merge pull request #189 from galgeek/ARI-6041
icaew.com behavior
2020-03-12 15:07:49 -07:00
Barbara Miller
4c0785fbfc
Merge pull request #187 from internetarchive/optimizes-rethinkdb-load-query
With the last commit, the only test failure is unrelated test_brozzling.py::test_page_interstitial_exception (already marked xfail in qa).
2020-03-11 21:29:01 -07:00
Barbara Miller
c4beeefe01 address var 2020-03-11 20:56:52 -07:00
Barbara Miller
2dfe3632f5 xfail test 2020-03-11 20:37:30 -07:00
James Kafader
313cec3139 coerce to dict not list 2020-03-11 19:31:02 -07:00
James Kafader
b9c5e4b66c fix output format 2020-03-11 19:15:57 -07:00
James Kafader
3defd49677 new selection function, based on optimized query 2020-03-11 16:09:16 -07:00
jkafader
1d9a95dfc2
Merge pull request #186 from galgeek/simpler_choose_warcprox
Simpler choose warcprox
2020-03-11 14:16:57 -07:00
Barbara Miller
f8f7aa1dca maybe fewer warcproxes 2020-03-11 14:08:34 -07:00
Barbara Miller
d190122a6d random.choice 2020-03-11 14:00:07 -07:00
Barbara Miller
af39b8cc6f skip active_sites query 2020-03-11 13:40:37 -07:00
Barbara Miller
414d1579fc icaew.com behavior 2020-03-03 11:47:34 -08:00
Noah Levitt
c2a1ca018a bump version after merge 2019-12-10 10:43:01 -08:00
Noah Levitt
558a0dd615
Merge pull request #184 from nlevitt/limit-failures
consider page completed after 3 failures
2019-12-10 10:42:43 -08:00
Noah Levitt
597f2b5b33 reveal bad value when job conf validation fails 2019-12-04 15:11:53 -08:00
Noah Levitt
7915220ab7 consider page completed after 3 failures
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807

"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
060adaffd0
Merge pull request #182 from CorentinB/patch-2
Fix Facebook ads variant selector
2019-11-27 16:10:55 -08:00
Noah Levitt
5aeaf47b6b bump version after merge 2019-11-27 12:41:16 -08:00
Noah Levitt
d6ac80af93
Merge pull request #181 from vbanos/no-sandbox
Enable running in docker / k8s
2019-11-27 12:40:42 -08:00
Vangelis Banos
3bc2f434ef Split extra chrome args on whitespace
This is in case multiple args are used.
2019-11-27 20:18:41 +00:00
Noah Levitt
64da843dc8
fix travis badge 2019-11-25 16:04:13 -08:00
Vangelis Banos
62cb051f93 Pass extra CLI params to chrome using ENV variable
If ENV var `BROZZLER_EXTRA_CHROME_ARGS` is set, pass its contents as
extra chromium cli options.

Remove `--no-sandbox` option. Its not good from a security point of
view.
2019-11-25 20:44:25 +00:00
Corentin Barreau
ff523b3bba
Fix: facebook ads variant selector 2019-11-25 17:48:33 +01:00
Noah Levitt
5094267ae8 bump version after merge 2019-11-15 20:38:05 -08:00
Noah Levitt
dcba6c58e3
Merge pull request #168 from CorentinB/facebook
Implement facebook.js with behaviors.yaml
2019-11-15 20:37:31 -08:00
Corentin Barreau
0c7e93c941
Remove custom interval 2019-11-16 02:11:05 +01:00
Noah Levitt
0cf3a5c12a bump version after merge 2019-11-15 11:08:57 -08:00
Noah Levitt
3136eefb77
Merge pull request #180 from galgeek/UmbraBFB
scroll down, and down, then scroll up
2019-11-15 11:08:29 -08:00
Vangelis Banos
35c5fa482f Enable running in docker / k8s
When trying to run Brozzler in docker, we get the following error:
```
Failed to move to new namespace: PID namespaces supported, Network
namespace supported, but failed: errno = Operation not permitted
Trace/breakpoint trap
```
This happens because chromium uses sandboxing for increased security by
default and its not supported when running in a container.
Adding chromium option `--no-sandbox` fixes the problem.

This issue is common, I found various reports about it like this:
https://github.com/Zenika/alpine-chrome/issues/33
2019-11-15 13:20:30 +00:00
Barbara Miller
9001449c70 prioritize scrolling down 2019-11-14 17:46:34 -08:00
Barbara Miller
ef70907040
Merge pull request #179 from CorentinB/fix-fb-ads-variants
Fix Facebook Ads Library variants selector
2019-11-13 13:08:47 -08:00
Corentin Barreau
beb80da7d2 Fix ads variant selector 2019-11-13 18:11:48 +01:00
Noah Levitt
395ff69f0a bump version after merge 2019-11-06 13:28:45 -08:00
Noah Levitt
802fbff986
Merge pull request #178 from galgeek/ARI-5995-tidied
ARI-5995 instagram capture updates
2019-11-06 13:26:56 -08:00
Corentin Barreau
06fba51b7f Restore 500ms interval speed 2019-11-06 14:11:19 +01:00
Barbara Miller
ac4a3f9914 simpler check, interval; 500 2019-11-05 17:23:01 -08:00
Noah Levitt
754b92cb96 bump version after merge 2019-11-04 15:20:58 -08:00
Noah Levitt
5bbd262144
Merge pull request #177 from CorentinB/fb-ads-variants
Add capture of Facebook ads variants
2019-11-04 15:20:42 -08:00
Corentin Barreau
ea021ab568 Add: capture of variant ads 2019-11-03 13:35:09 +01:00
Corentin Barreau
e414658056 Add: childSelector 2019-11-03 13:21:23 +01:00
Corentin Barreau
9b54723802 Change interval speed 2019-10-31 23:05:54 +01:00
Corentin Barreau
c3e4597d1a Revert "Change interval speed"
This reverts commit 473fd9e3936be5b179bf7a0b6091bef91fade0c0.
2019-10-31 23:04:50 +01:00
Corentin Barreau
473fd9e393 Change interval speed 2019-10-31 23:02:59 +01:00
Noah Levitt
a85d95e145 bump version after merge 2019-10-31 15:00:59 -07:00
Noah Levitt
3c85cb34c3
Merge pull request #175 from vbanos/whatwg-outlinks
Use urlcanon.whatwg in extracted outlinks
2019-10-31 15:00:31 -07:00