1218 Commits

Author SHA1 Message Date
Barbara Miller
9001449c70 prioritize scrolling down 2019-11-14 17:46:34 -08:00
Barbara Miller
ef70907040
Merge pull request #179 from CorentinB/fix-fb-ads-variants
Fix Facebook Ads Library variants selector
2019-11-13 13:08:47 -08:00
Corentin Barreau
beb80da7d2 Fix ads variant selector 2019-11-13 18:11:48 +01:00
Noah Levitt
395ff69f0a bump version after merge 2019-11-06 13:28:45 -08:00
Noah Levitt
802fbff986
Merge pull request #178 from galgeek/ARI-5995-tidied
ARI-5995 instagram capture updates
2019-11-06 13:26:56 -08:00
Barbara Miller
ac4a3f9914 simpler check, interval; 500 2019-11-05 17:23:01 -08:00
Noah Levitt
754b92cb96 bump version after merge 2019-11-04 15:20:58 -08:00
Noah Levitt
5bbd262144
Merge pull request #177 from CorentinB/fb-ads-variants
Add capture of Facebook ads variants
2019-11-04 15:20:42 -08:00
Corentin Barreau
ea021ab568 Add: capture of variant ads 2019-11-03 13:35:09 +01:00
Corentin Barreau
e414658056 Add: childSelector 2019-11-03 13:21:23 +01:00
Noah Levitt
a85d95e145 bump version after merge 2019-10-31 15:00:59 -07:00
Noah Levitt
3c85cb34c3
Merge pull request #175 from vbanos/whatwg-outlinks
Use urlcanon.whatwg in extracted outlinks
2019-10-31 15:00:31 -07:00
Vangelis Banos
33b7a7f564 Use urlcanon.whatwg in extracted outlinks
The aim is to improve outlink quality.
2019-10-31 21:27:55 +00:00
Barbara Miller
37e1c7ed55 rmSelector to remove() login div 2019-10-17 18:03:12 -07:00
Noah Levitt
8beb96817e bump version after merge 2019-10-17 13:48:23 -07:00
Noah Levitt
e23fa68d65 fix bug clobbering own changes to parent_page
and some other tweaks (python 3.5+, pytest logging config, ...)
2019-10-17 13:47:54 -07:00
Noah Levitt
ba85917f70
Merge pull request #172 from vbanos/block-more-analytics
Block more google-analytics URLs
2019-10-16 10:49:48 -07:00
Barbara Miller
ddf19121fd limit=1 not firstMatchOnly plus nextAction 2019-10-15 15:59:12 -07:00
Barbara Miller
66a29dc8fe update first close selector 2019-10-15 14:35:54 -07:00
Barbara Miller
c62c9f9063 delay instagram youtube-dl captures; collapse if block 2019-10-15 14:35:33 -07:00
Vangelis Banos
f23f49108b Block more google-analytics URLs
After analysing capture logs, we see that we didn't block many
google-analytics related URLS which are used for web statistics. We add
these to the blocked URLs.

In addition, we improve existing block rules. We used to block
`*google-analytics.com/analytics.js` but many sites used some kind of
param in the end so these URLs weren't blocked. We add `*` in the end of
the existing rules to block these cases as well.
2019-10-11 10:45:23 +00:00
Noah Levitt
1bda52d4c9 bump version 2019-10-09 16:28:58 -07:00
Noah Levitt
65c7ccdcff brozzle-page --screenshot-full-page option 2019-10-09 16:28:26 -07:00
Noah Levitt
e5a3ada349
Merge pull request #171 from vbanos/screenshot-full-screen
Add option to capture full page screenshot
2019-10-09 16:27:05 -07:00
Vangelis Banos
ba901e3a99 Fix JPEG thumbnail problems
Due to the fact that we run JS behaviors before we capture the
screenshot, the browser could be scrolled down in the page. When we
don't capture the full page, we may get a screenshot of the bottom part of
the page and not the top. To fix that we run `window.scroll(0, 0)`
before capturing the screenshot.

We change method `BrozzlerWorker.full_and_thumb_jpegs` to
`BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now
from the browser after our changes at `Browser.screenshot`.

`thumb_jpeg` only returns a thumbnail now. There is no need to read PNG
and convert to JPEG. This means that screenshots will be a bit faster
now :)
2019-10-09 13:34:38 +00:00
Vangelis Banos
674da4aa99 Use JPEG quality: 95 for screenshots 2019-10-09 11:57:18 +00:00
Vangelis Banos
544222b021 Moved screenshot code right after run_behavior
There were some weird screeshots when invoking `try_screenshot` in the end
after `visit_hashtags` and `extract_outlinks`. The screenshot was
distorted.
2019-10-09 11:39:32 +00:00
Vangelis Banos
c007cda87e Capture screenshot after running behaviors
This is necessary to load all images before taking the screenshot.
2019-10-09 11:05:58 +00:00
Vangelis Banos
34d8f87fb5 Add option to capture full page screenshot
Add option `full_page` to `Browser.screenshot`. The default behavior
remains the same.
We get inspiration from puppeteer to capture a screenshot of the full
page:
https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898

Add option `screenshot_full_page=False` to `Browser.browse_page` to use
the new feature when capturing a page.
2019-10-08 10:55:10 +00:00
Noah Levitt
464562461c
Merge pull request #170 from danielbicho/master
Add option to specify port and interface binding on brozzler-dashboard
2019-10-03 14:43:53 -07:00
Daniel Bicho
4feede08e4 Add option to specify port and interface binding on brozzler-dashboard 2019-10-03 15:20:03 +01:00
Noah Levitt
8a51f28c3d fix dishonest travis badge 2019-10-02 15:02:56 -07:00
Noah Levitt
85e6027838 bump version after merge 2019-09-27 10:40:59 -07:00
Noah Levitt
996070b35c
Merge pull request #167 from vbanos/console-debug-only
Enable Console and Runtime outputs only when debugging
2019-09-27 10:40:17 -07:00
Vangelis Banos
fed5e6b741 Enable Console and Runtime outputs only when debugging
When capturing a page, we receive a LOT of messages from chrome.
Examining these message, we see that we can reduce them a bit to speed
up Brozzler.

We always use `Console.enable` which returns all browser console output.
Also, we always use `Runtime.enable`. Doc says:
https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable

Enables reporting of execution contexts creation by means of
executionContextCreated event. When the reporting gets enabled the event
will be sent immediately for each existing execution context.

These outputs are useful when debugging but not in production.
If we disable them, we reduce the websocket traffic and improve
performance. With this PR, we enable them only when the current logging
level is `DEBUG`.

Counting the number of messages before and after the change, we see
improvements like:

https://www.gnome.org/technologies/ 220 -> 202 messages.

https://www.whitehouse.gov/issues/budget-spending/  203 -> 189 messages
2019-09-27 13:24:06 +00:00
Noah Levitt
7273c7c3a2
Merge pull request #166 from CorentinB/facebook-ads-lib
Add support for Facebook ads library and fix closing
2019-09-26 14:13:47 -07:00
Corentin Barreau
e701e3f101 Add: break after closing the first visible element 2019-09-26 21:44:25 +02:00
Corentin Barreau
101f7f2e4a Remove: useless comment 2019-09-25 19:48:38 +02:00
Corentin Barreau
fb30fb9aa3 Add: isVisible check for close selectors
Modify: doTarget - Revert to initial code
2019-09-25 16:19:41 +02:00
Corentin Barreau
5c5743ea11 Fix: closeSelector not being clicked
Add: support for facebook.com/ads/library - Open and close metrics for ads
2019-09-25 16:10:59 +02:00
Noah Levitt
efa185a8dc
Merge pull request #160 from vbanos/behavior-timeout
More accurate JS behavior timeout
2019-09-24 12:11:37 -07:00
Noah Levitt
eb30ba0c33
Merge pull request #165 from vbanos/stderr-stdout-exception-handling
Improve exception handling when reading STDIN/STDERR
2019-09-24 12:03:06 -07:00
Vangelis Banos
f42ff08da1 Improve exception handling when reading STDIN/STDERR
When the chrome process dies and we try to read STDIN/STDERR, we get
`ValueError: I/O operation on closed file` or
`OSError: [Errno 9] Bad file descriptor`.

We modify `readline_nonblock` method to return the buffer it read up to
this point.
2019-09-19 20:08:55 +00:00
Vangelis Banos
0b28a4a57f More accurate JS behavior timeout
If you use a JS behavior timeout smaller than 7 sec, the JS behavior
will always need 7 sec because `sleep(7)` is hard-coded there.

We make a minor addition to use `min(timeout, 7)` for sleep so it will
finish faster when using a smaller JS behavior timeout.
2019-08-22 21:15:44 +00:00
Noah Levitt
16f886259d
Merge pull request #158 from galgeek/aitfive-1668-soundcoud
capture soundcloud user page before capturing tracks
2019-08-15 15:46:55 -07:00
Noah Levitt
94cd6cacb6 bump version after merge 2019-07-18 11:07:27 -07:00
Noah Levitt
726c6effed
Merge pull request #157 from vbanos/block-amp-analytics
Block AMP analytics JS script
2019-07-18 11:07:09 -07:00
Barbara Miller
9cc60449d7 skip downloading tracks from soundcloud user page 2019-07-17 17:45:02 -07:00
Vangelis Banos
6bd4fd6532 Block AMP analytics JS script
AMP analytics is part of Google analytics. We need to block it for
similar reasons.

AMP analytics reference:

https://developers.google.com/analytics/devguides/collection/amp-analytics/
2019-06-26 21:19:35 +00:00
Noah Levitt
8107abd804
Merge pull request #154 from vbanos/fix-brozzling-test
Fix test_brozzling::httpd fixture
1.5.6
2019-05-16 14:23:04 -07:00