1200 Commits

Author SHA1 Message Date
Noah Levitt
e23fa68d65 fix bug clobbering own changes to parent_page
and some other tweaks (python 3.5+, pytest logging config, ...)
2019-10-17 13:47:54 -07:00
Noah Levitt
ba85917f70
Merge pull request #172 from vbanos/block-more-analytics
Block more google-analytics URLs
2019-10-16 10:49:48 -07:00
Vangelis Banos
f23f49108b Block more google-analytics URLs
After analysing capture logs, we see that we didn't block many
google-analytics related URLS which are used for web statistics. We add
these to the blocked URLs.

In addition, we improve existing block rules. We used to block
`*google-analytics.com/analytics.js` but many sites used some kind of
param in the end so these URLs weren't blocked. We add `*` in the end of
the existing rules to block these cases as well.
2019-10-11 10:45:23 +00:00
Noah Levitt
1bda52d4c9 bump version 2019-10-09 16:28:58 -07:00
Noah Levitt
65c7ccdcff brozzle-page --screenshot-full-page option 2019-10-09 16:28:26 -07:00
Noah Levitt
e5a3ada349
Merge pull request #171 from vbanos/screenshot-full-screen
Add option to capture full page screenshot
2019-10-09 16:27:05 -07:00
Vangelis Banos
ba901e3a99 Fix JPEG thumbnail problems
Due to the fact that we run JS behaviors before we capture the
screenshot, the browser could be scrolled down in the page. When we
don't capture the full page, we may get a screenshot of the bottom part of
the page and not the top. To fix that we run `window.scroll(0, 0)`
before capturing the screenshot.

We change method `BrozzlerWorker.full_and_thumb_jpegs` to
`BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now
from the browser after our changes at `Browser.screenshot`.

`thumb_jpeg` only returns a thumbnail now. There is no need to read PNG
and convert to JPEG. This means that screenshots will be a bit faster
now :)
2019-10-09 13:34:38 +00:00
Vangelis Banos
674da4aa99 Use JPEG quality: 95 for screenshots 2019-10-09 11:57:18 +00:00
Vangelis Banos
544222b021 Moved screenshot code right after run_behavior
There were some weird screeshots when invoking `try_screenshot` in the end
after `visit_hashtags` and `extract_outlinks`. The screenshot was
distorted.
2019-10-09 11:39:32 +00:00
Vangelis Banos
c007cda87e Capture screenshot after running behaviors
This is necessary to load all images before taking the screenshot.
2019-10-09 11:05:58 +00:00
Vangelis Banos
34d8f87fb5 Add option to capture full page screenshot
Add option `full_page` to `Browser.screenshot`. The default behavior
remains the same.
We get inspiration from puppeteer to capture a screenshot of the full
page:
https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898

Add option `screenshot_full_page=False` to `Browser.browse_page` to use
the new feature when capturing a page.
2019-10-08 10:55:10 +00:00
Noah Levitt
464562461c
Merge pull request #170 from danielbicho/master
Add option to specify port and interface binding on brozzler-dashboard
2019-10-03 14:43:53 -07:00
Daniel Bicho
4feede08e4 Add option to specify port and interface binding on brozzler-dashboard 2019-10-03 15:20:03 +01:00
Noah Levitt
8a51f28c3d fix dishonest travis badge 2019-10-02 15:02:56 -07:00
Noah Levitt
85e6027838 bump version after merge 2019-09-27 10:40:59 -07:00
Noah Levitt
996070b35c
Merge pull request #167 from vbanos/console-debug-only
Enable Console and Runtime outputs only when debugging
2019-09-27 10:40:17 -07:00
Vangelis Banos
fed5e6b741 Enable Console and Runtime outputs only when debugging
When capturing a page, we receive a LOT of messages from chrome.
Examining these message, we see that we can reduce them a bit to speed
up Brozzler.

We always use `Console.enable` which returns all browser console output.
Also, we always use `Runtime.enable`. Doc says:
https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable

Enables reporting of execution contexts creation by means of
executionContextCreated event. When the reporting gets enabled the event
will be sent immediately for each existing execution context.

These outputs are useful when debugging but not in production.
If we disable them, we reduce the websocket traffic and improve
performance. With this PR, we enable them only when the current logging
level is `DEBUG`.

Counting the number of messages before and after the change, we see
improvements like:

https://www.gnome.org/technologies/ 220 -> 202 messages.

https://www.whitehouse.gov/issues/budget-spending/  203 -> 189 messages
2019-09-27 13:24:06 +00:00
Noah Levitt
7273c7c3a2
Merge pull request #166 from CorentinB/facebook-ads-lib
Add support for Facebook ads library and fix closing
2019-09-26 14:13:47 -07:00
Corentin Barreau
e701e3f101 Add: break after closing the first visible element 2019-09-26 21:44:25 +02:00
Corentin Barreau
101f7f2e4a Remove: useless comment 2019-09-25 19:48:38 +02:00
Corentin Barreau
fb30fb9aa3 Add: isVisible check for close selectors
Modify: doTarget - Revert to initial code
2019-09-25 16:19:41 +02:00
Corentin Barreau
5c5743ea11 Fix: closeSelector not being clicked
Add: support for facebook.com/ads/library - Open and close metrics for ads
2019-09-25 16:10:59 +02:00
Noah Levitt
efa185a8dc
Merge pull request #160 from vbanos/behavior-timeout
More accurate JS behavior timeout
2019-09-24 12:11:37 -07:00
Noah Levitt
eb30ba0c33
Merge pull request #165 from vbanos/stderr-stdout-exception-handling
Improve exception handling when reading STDIN/STDERR
2019-09-24 12:03:06 -07:00
Vangelis Banos
f42ff08da1 Improve exception handling when reading STDIN/STDERR
When the chrome process dies and we try to read STDIN/STDERR, we get
`ValueError: I/O operation on closed file` or
`OSError: [Errno 9] Bad file descriptor`.

We modify `readline_nonblock` method to return the buffer it read up to
this point.
2019-09-19 20:08:55 +00:00
Vangelis Banos
0b28a4a57f More accurate JS behavior timeout
If you use a JS behavior timeout smaller than 7 sec, the JS behavior
will always need 7 sec because `sleep(7)` is hard-coded there.

We make a minor addition to use `min(timeout, 7)` for sleep so it will
finish faster when using a smaller JS behavior timeout.
2019-08-22 21:15:44 +00:00
Noah Levitt
16f886259d
Merge pull request #158 from galgeek/aitfive-1668-soundcoud
capture soundcloud user page before capturing tracks
2019-08-15 15:46:55 -07:00
Noah Levitt
94cd6cacb6 bump version after merge 2019-07-18 11:07:27 -07:00
Noah Levitt
726c6effed
Merge pull request #157 from vbanos/block-amp-analytics
Block AMP analytics JS script
2019-07-18 11:07:09 -07:00
Barbara Miller
9cc60449d7 skip downloading tracks from soundcloud user page 2019-07-17 17:45:02 -07:00
Vangelis Banos
6bd4fd6532 Block AMP analytics JS script
AMP analytics is part of Google analytics. We need to block it for
similar reasons.

AMP analytics reference:

https://developers.google.com/analytics/devguides/collection/amp-analytics/
2019-06-26 21:19:35 +00:00
Noah Levitt
8107abd804
Merge pull request #154 from vbanos/fix-brozzling-test
Fix test_brozzling::httpd fixture
1.5.6
2019-05-16 14:23:04 -07:00
Noah Levitt
5fdb2dd39c documentation tweak 2019-05-16 14:03:43 -07:00
Noah Levitt
aa2d491009 i don't know where pyyaml 5.8 came from 2019-05-16 01:29:05 -07:00
Noah Levitt
42ddfba923
Merge pull request #150 from nlevitt/purge-old
Purge old
2019-05-16 00:29:58 -07:00
Noah Levitt
40331f02ba
Merge pull request #153 from vbanos/warn-deprecated
logging.warn is deprecated and replaced by logging.warning
2019-05-16 00:27:22 -07:00
Noah Levitt
f8db17ce3d bump version after merge 2019-05-16 00:22:29 -07:00
Noah Levitt
eb34bebb91
Merge pull request #149 from nlevitt/travis-py37
trying to make this work with xenial for travis
2019-05-16 00:22:08 -07:00
Noah Levitt
c651bcdd18 remove some travis-ci debugging stuff 2019-05-16 00:21:28 -07:00
Noah Levitt
0a1360ab25 don't use localhost for test http server...
... because apparently sometimes chromium bypasses the proxy for local
addresses
2019-05-15 18:49:18 -07:00
Noah Levitt
f8165dc02b work around pytest issue until fix is out
https://github.com/pytest-dev/pytest/issues/5257
2019-05-15 18:46:21 -07:00
Vangelis Banos
a1f9122317 Fix test_brozzling::httpd fixture
We used `self.headers.getheader` which no longer works. We replace it
with `self.headers.get`.

We change the code to write binary data to `self.wfile` because we get
an exception for writing str and/or None.
2019-05-14 16:29:52 +00:00
Vangelis Banos
a2ac3a0374 logging.warn is deprecated and replaced by logging.warning
We replace it everywhere in the code base.
2019-05-14 12:10:59 +00:00
Noah Levitt
ee8ef23f0c fix mistake in job-conf.rst 2019-04-30 10:49:48 -07:00
Noah Levitt
411b3f266a bump version after merge 2019-04-09 22:07:51 +00:00
Noah Levitt
d4386491df
Merge pull request #151 from nlevitt/no-cerberus-normalize
don't attempt cerberus normalization
2019-04-09 15:06:17 -07:00
Noah Levitt
5385232b40 don't attempt cerberus normalization
which encumbers the validation with additional requirements,
specifically makes it difficult to validate a subclass of `dict` because
it expects a constructor that works like dict.__init__()
2019-04-09 01:45:37 -07:00
Noah Levitt
8dfd92cf7f fix this utility 2019-04-09 01:44:14 -07:00
Noah Levitt
433b201b52 use logging.warning() to quiet py37 warnings 2019-04-09 01:43:38 -07:00
Noah Levitt
dfd9d9ecdd omfg 2019-04-04 17:22:15 -07:00