Vangelis Banos
140c27abe8
Skip running behaviors when page is 4xx or 5xx
...
Currently, when we run `Browser.browse_page`, we run JS behaviors after
we navigate to a page regardless of its status.
Maybe the page wasn't found (4xx) or unreachable for any reason (5xx).
In that case, we could skip running behaviors to save time and
resources.
With this PR, we add a new var to store navigated page HTTP status in
`WebsockReceiverThread.page_status`. We use this in
`Browser.browser_page` to skip behaviors, outlink and hashtag extraction
when page status is 4xx/5xx.
Note that we don't skip screenshots as it could be useful to have a
picture of an error page in some cases.
2020-03-23 16:21:57 +00:00
jkafader
3b249333a4
Merge pull request #189 from galgeek/ARI-6041
...
icaew.com behavior
2020-03-12 15:07:49 -07:00
Barbara Miller
4c0785fbfc
Merge pull request #187 from internetarchive/optimizes-rethinkdb-load-query
...
With the last commit, the only test failure is unrelated test_brozzling.py::test_page_interstitial_exception (already marked xfail in qa).
2020-03-11 21:29:01 -07:00
Barbara Miller
c4beeefe01
address var
2020-03-11 20:56:52 -07:00
Barbara Miller
2dfe3632f5
xfail test
2020-03-11 20:37:30 -07:00
James Kafader
313cec3139
coerce to dict not list
2020-03-11 19:31:02 -07:00
James Kafader
b9c5e4b66c
fix output format
2020-03-11 19:15:57 -07:00
James Kafader
3defd49677
new selection function, based on optimized query
2020-03-11 16:09:16 -07:00
jkafader
1d9a95dfc2
Merge pull request #186 from galgeek/simpler_choose_warcprox
...
Simpler choose warcprox
2020-03-11 14:16:57 -07:00
Barbara Miller
f8f7aa1dca
maybe fewer warcproxes
2020-03-11 14:08:34 -07:00
Barbara Miller
d190122a6d
random.choice
2020-03-11 14:00:07 -07:00
Barbara Miller
af39b8cc6f
skip active_sites query
2020-03-11 13:40:37 -07:00
Barbara Miller
414d1579fc
icaew.com behavior
2020-03-03 11:47:34 -08:00
Noah Levitt
c2a1ca018a
bump version after merge
2019-12-10 10:43:01 -08:00
Noah Levitt
558a0dd615
Merge pull request #184 from nlevitt/limit-failures
...
consider page completed after 3 failures
2019-12-10 10:42:43 -08:00
Noah Levitt
597f2b5b33
reveal bad value when job conf validation fails
2019-12-04 15:11:53 -08:00
Noah Levitt
7915220ab7
consider page completed after 3 failures
...
https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807
"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
2019-12-04 12:38:22 -08:00
Noah Levitt
060adaffd0
Merge pull request #182 from CorentinB/patch-2
...
Fix Facebook ads variant selector
2019-11-27 16:10:55 -08:00
Noah Levitt
5aeaf47b6b
bump version after merge
2019-11-27 12:41:16 -08:00
Noah Levitt
d6ac80af93
Merge pull request #181 from vbanos/no-sandbox
...
Enable running in docker / k8s
2019-11-27 12:40:42 -08:00
Vangelis Banos
3bc2f434ef
Split extra chrome args on whitespace
...
This is in case multiple args are used.
2019-11-27 20:18:41 +00:00
Noah Levitt
64da843dc8
fix travis badge
2019-11-25 16:04:13 -08:00
Vangelis Banos
62cb051f93
Pass extra CLI params to chrome using ENV variable
...
If ENV var `BROZZLER_EXTRA_CHROME_ARGS` is set, pass its contents as
extra chromium cli options.
Remove `--no-sandbox` option. Its not good from a security point of
view.
2019-11-25 20:44:25 +00:00
Corentin Barreau
ff523b3bba
Fix: facebook ads variant selector
2019-11-25 17:48:33 +01:00
Noah Levitt
5094267ae8
bump version after merge
2019-11-15 20:38:05 -08:00
Noah Levitt
dcba6c58e3
Merge pull request #168 from CorentinB/facebook
...
Implement facebook.js with behaviors.yaml
2019-11-15 20:37:31 -08:00
Corentin Barreau
0c7e93c941
Remove custom interval
2019-11-16 02:11:05 +01:00
Noah Levitt
0cf3a5c12a
bump version after merge
2019-11-15 11:08:57 -08:00
Noah Levitt
3136eefb77
Merge pull request #180 from galgeek/UmbraBFB
...
scroll down, and down, then scroll up
2019-11-15 11:08:29 -08:00
Vangelis Banos
35c5fa482f
Enable running in docker / k8s
...
When trying to run Brozzler in docker, we get the following error:
```
Failed to move to new namespace: PID namespaces supported, Network
namespace supported, but failed: errno = Operation not permitted
Trace/breakpoint trap
```
This happens because chromium uses sandboxing for increased security by
default and its not supported when running in a container.
Adding chromium option `--no-sandbox` fixes the problem.
This issue is common, I found various reports about it like this:
https://github.com/Zenika/alpine-chrome/issues/33
2019-11-15 13:20:30 +00:00
Barbara Miller
9001449c70
prioritize scrolling down
2019-11-14 17:46:34 -08:00
Barbara Miller
ef70907040
Merge pull request #179 from CorentinB/fix-fb-ads-variants
...
Fix Facebook Ads Library variants selector
2019-11-13 13:08:47 -08:00
Corentin Barreau
beb80da7d2
Fix ads variant selector
2019-11-13 18:11:48 +01:00
Noah Levitt
395ff69f0a
bump version after merge
2019-11-06 13:28:45 -08:00
Noah Levitt
802fbff986
Merge pull request #178 from galgeek/ARI-5995-tidied
...
ARI-5995 instagram capture updates
2019-11-06 13:26:56 -08:00
Corentin Barreau
06fba51b7f
Restore 500ms interval speed
2019-11-06 14:11:19 +01:00
Barbara Miller
ac4a3f9914
simpler check, interval; 500
2019-11-05 17:23:01 -08:00
Noah Levitt
754b92cb96
bump version after merge
2019-11-04 15:20:58 -08:00
Noah Levitt
5bbd262144
Merge pull request #177 from CorentinB/fb-ads-variants
...
Add capture of Facebook ads variants
2019-11-04 15:20:42 -08:00
Corentin Barreau
ea021ab568
Add: capture of variant ads
2019-11-03 13:35:09 +01:00
Corentin Barreau
e414658056
Add: childSelector
2019-11-03 13:21:23 +01:00
Corentin Barreau
9b54723802
Change interval speed
2019-10-31 23:05:54 +01:00
Corentin Barreau
c3e4597d1a
Revert "Change interval speed"
...
This reverts commit 473fd9e3936be5b179bf7a0b6091bef91fade0c0.
2019-10-31 23:04:50 +01:00
Corentin Barreau
473fd9e393
Change interval speed
2019-10-31 23:02:59 +01:00
Noah Levitt
a85d95e145
bump version after merge
2019-10-31 15:00:59 -07:00
Noah Levitt
3c85cb34c3
Merge pull request #175 from vbanos/whatwg-outlinks
...
Use urlcanon.whatwg in extracted outlinks
2019-10-31 15:00:31 -07:00
Vangelis Banos
33b7a7f564
Use urlcanon.whatwg in extracted outlinks
...
The aim is to improve outlink quality.
2019-10-31 21:27:55 +00:00
Barbara Miller
37e1c7ed55
rmSelector to remove() login div
2019-10-17 18:03:12 -07:00
Noah Levitt
8beb96817e
bump version after merge
2019-10-17 13:48:23 -07:00
Noah Levitt
e23fa68d65
fix bug clobbering own changes to parent_page
...
and some other tweaks (python 3.5+, pytest logging config, ...)
2019-10-17 13:47:54 -07:00