1384 Commits

Author SHA1 Message Date
Vangelis Banos
3bc2f434ef Split extra chrome args on whitespace
This is in case multiple args are used.
2019-11-27 20:18:41 +00:00
Noah Levitt
64da843dc8
fix travis badge 2019-11-25 16:04:13 -08:00
Vangelis Banos
62cb051f93 Pass extra CLI params to chrome using ENV variable
If ENV var `BROZZLER_EXTRA_CHROME_ARGS` is set, pass its contents as
extra chromium cli options.

Remove `--no-sandbox` option. Its not good from a security point of
view.
2019-11-25 20:44:25 +00:00
Corentin Barreau
ff523b3bba
Fix: facebook ads variant selector 2019-11-25 17:48:33 +01:00
Noah Levitt
5094267ae8 bump version after merge 2019-11-15 20:38:05 -08:00
Noah Levitt
dcba6c58e3
Merge pull request #168 from CorentinB/facebook
Implement facebook.js with behaviors.yaml
2019-11-15 20:37:31 -08:00
Corentin Barreau
0c7e93c941
Remove custom interval 2019-11-16 02:11:05 +01:00
Noah Levitt
0cf3a5c12a bump version after merge 2019-11-15 11:08:57 -08:00
Noah Levitt
3136eefb77
Merge pull request #180 from galgeek/UmbraBFB
scroll down, and down, then scroll up
2019-11-15 11:08:29 -08:00
Vangelis Banos
35c5fa482f Enable running in docker / k8s
When trying to run Brozzler in docker, we get the following error:
```
Failed to move to new namespace: PID namespaces supported, Network
namespace supported, but failed: errno = Operation not permitted
Trace/breakpoint trap
```
This happens because chromium uses sandboxing for increased security by
default and its not supported when running in a container.
Adding chromium option `--no-sandbox` fixes the problem.

This issue is common, I found various reports about it like this:
https://github.com/Zenika/alpine-chrome/issues/33
2019-11-15 13:20:30 +00:00
Barbara Miller
9001449c70 prioritize scrolling down 2019-11-14 17:46:34 -08:00
Barbara Miller
ef70907040
Merge pull request #179 from CorentinB/fix-fb-ads-variants
Fix Facebook Ads Library variants selector
2019-11-13 13:08:47 -08:00
Corentin Barreau
beb80da7d2 Fix ads variant selector 2019-11-13 18:11:48 +01:00
Noah Levitt
395ff69f0a bump version after merge 2019-11-06 13:28:45 -08:00
Noah Levitt
802fbff986
Merge pull request #178 from galgeek/ARI-5995-tidied
ARI-5995 instagram capture updates
2019-11-06 13:26:56 -08:00
Corentin Barreau
06fba51b7f Restore 500ms interval speed 2019-11-06 14:11:19 +01:00
Barbara Miller
ac4a3f9914 simpler check, interval; 500 2019-11-05 17:23:01 -08:00
Noah Levitt
754b92cb96 bump version after merge 2019-11-04 15:20:58 -08:00
Noah Levitt
5bbd262144
Merge pull request #177 from CorentinB/fb-ads-variants
Add capture of Facebook ads variants
2019-11-04 15:20:42 -08:00
Corentin Barreau
ea021ab568 Add: capture of variant ads 2019-11-03 13:35:09 +01:00
Corentin Barreau
e414658056 Add: childSelector 2019-11-03 13:21:23 +01:00
Corentin Barreau
9b54723802 Change interval speed 2019-10-31 23:05:54 +01:00
Corentin Barreau
c3e4597d1a Revert "Change interval speed"
This reverts commit 473fd9e3936be5b179bf7a0b6091bef91fade0c0.
2019-10-31 23:04:50 +01:00
Corentin Barreau
473fd9e393 Change interval speed 2019-10-31 23:02:59 +01:00
Noah Levitt
a85d95e145 bump version after merge 2019-10-31 15:00:59 -07:00
Noah Levitt
3c85cb34c3
Merge pull request #175 from vbanos/whatwg-outlinks
Use urlcanon.whatwg in extracted outlinks
2019-10-31 15:00:31 -07:00
Vangelis Banos
33b7a7f564 Use urlcanon.whatwg in extracted outlinks
The aim is to improve outlink quality.
2019-10-31 21:27:55 +00:00
Barbara Miller
37e1c7ed55 rmSelector to remove() login div 2019-10-17 18:03:12 -07:00
Noah Levitt
8beb96817e bump version after merge 2019-10-17 13:48:23 -07:00
Noah Levitt
e23fa68d65 fix bug clobbering own changes to parent_page
and some other tweaks (python 3.5+, pytest logging config, ...)
2019-10-17 13:47:54 -07:00
Noah Levitt
ba85917f70
Merge pull request #172 from vbanos/block-more-analytics
Block more google-analytics URLs
2019-10-16 10:49:48 -07:00
Barbara Miller
ddf19121fd limit=1 not firstMatchOnly plus nextAction 2019-10-15 15:59:12 -07:00
Barbara Miller
66a29dc8fe update first close selector 2019-10-15 14:35:54 -07:00
Barbara Miller
c62c9f9063 delay instagram youtube-dl captures; collapse if block 2019-10-15 14:35:33 -07:00
Vangelis Banos
f23f49108b Block more google-analytics URLs
After analysing capture logs, we see that we didn't block many
google-analytics related URLS which are used for web statistics. We add
these to the blocked URLs.

In addition, we improve existing block rules. We used to block
`*google-analytics.com/analytics.js` but many sites used some kind of
param in the end so these URLs weren't blocked. We add `*` in the end of
the existing rules to block these cases as well.
2019-10-11 10:45:23 +00:00
Noah Levitt
1bda52d4c9 bump version 2019-10-09 16:28:58 -07:00
Noah Levitt
65c7ccdcff brozzle-page --screenshot-full-page option 2019-10-09 16:28:26 -07:00
Noah Levitt
e5a3ada349
Merge pull request #171 from vbanos/screenshot-full-screen
Add option to capture full page screenshot
2019-10-09 16:27:05 -07:00
Vangelis Banos
ba901e3a99 Fix JPEG thumbnail problems
Due to the fact that we run JS behaviors before we capture the
screenshot, the browser could be scrolled down in the page. When we
don't capture the full page, we may get a screenshot of the bottom part of
the page and not the top. To fix that we run `window.scroll(0, 0)`
before capturing the screenshot.

We change method `BrozzlerWorker.full_and_thumb_jpegs` to
`BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now
from the browser after our changes at `Browser.screenshot`.

`thumb_jpeg` only returns a thumbnail now. There is no need to read PNG
and convert to JPEG. This means that screenshots will be a bit faster
now :)
2019-10-09 13:34:38 +00:00
Vangelis Banos
674da4aa99 Use JPEG quality: 95 for screenshots 2019-10-09 11:57:18 +00:00
Vangelis Banos
544222b021 Moved screenshot code right after run_behavior
There were some weird screeshots when invoking `try_screenshot` in the end
after `visit_hashtags` and `extract_outlinks`. The screenshot was
distorted.
2019-10-09 11:39:32 +00:00
Vangelis Banos
c007cda87e Capture screenshot after running behaviors
This is necessary to load all images before taking the screenshot.
2019-10-09 11:05:58 +00:00
Barbara Miller
30cbd3b13d add pop urls using regex for better match 2019-10-08 17:15:01 -07:00
Vangelis Banos
34d8f87fb5 Add option to capture full page screenshot
Add option `full_page` to `Browser.screenshot`. The default behavior
remains the same.
We get inspiration from puppeteer to capture a screenshot of the full
page:
https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898

Add option `screenshot_full_page=False` to `Browser.browse_page` to use
the new feature when capturing a page.
2019-10-08 10:55:10 +00:00
Noah Levitt
464562461c
Merge pull request #170 from danielbicho/master
Add option to specify port and interface binding on brozzler-dashboard
2019-10-03 14:43:53 -07:00
Daniel Bicho
4feede08e4 Add option to specify port and interface binding on brozzler-dashboard 2019-10-03 15:20:03 +01:00
Noah Levitt
8a51f28c3d fix dishonest travis badge 2019-10-02 15:02:56 -07:00
Corentin Barreau
f5ed91de6e Replace facebook.js with behaviors.yaml 2019-09-27 21:57:35 +02:00
Noah Levitt
85e6027838 bump version after merge 2019-09-27 10:40:59 -07:00
Noah Levitt
996070b35c
Merge pull request #167 from vbanos/console-debug-only
Enable Console and Runtime outputs only when debugging
2019-09-27 10:40:17 -07:00