1384 Commits

Author SHA1 Message Date
Vangelis Banos
fed5e6b741 Enable Console and Runtime outputs only when debugging
When capturing a page, we receive a LOT of messages from chrome.
Examining these message, we see that we can reduce them a bit to speed
up Brozzler.

We always use `Console.enable` which returns all browser console output.
Also, we always use `Runtime.enable`. Doc says:
https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable

Enables reporting of execution contexts creation by means of
executionContextCreated event. When the reporting gets enabled the event
will be sent immediately for each existing execution context.

These outputs are useful when debugging but not in production.
If we disable them, we reduce the websocket traffic and improve
performance. With this PR, we enable them only when the current logging
level is `DEBUG`.

Counting the number of messages before and after the change, we see
improvements like:

https://www.gnome.org/technologies/ 220 -> 202 messages.

https://www.whitehouse.gov/issues/budget-spending/  203 -> 189 messages
2019-09-27 13:24:06 +00:00
Noah Levitt
7273c7c3a2
Merge pull request #166 from CorentinB/facebook-ads-lib
Add support for Facebook ads library and fix closing
2019-09-26 14:13:47 -07:00
Corentin Barreau
e701e3f101 Add: break after closing the first visible element 2019-09-26 21:44:25 +02:00
Corentin Barreau
101f7f2e4a Remove: useless comment 2019-09-25 19:48:38 +02:00
Corentin Barreau
fb30fb9aa3 Add: isVisible check for close selectors
Modify: doTarget - Revert to initial code
2019-09-25 16:19:41 +02:00
Corentin Barreau
5c5743ea11 Fix: closeSelector not being clicked
Add: support for facebook.com/ads/library - Open and close metrics for ads
2019-09-25 16:10:59 +02:00
Noah Levitt
efa185a8dc
Merge pull request #160 from vbanos/behavior-timeout
More accurate JS behavior timeout
2019-09-24 12:11:37 -07:00
Noah Levitt
eb30ba0c33
Merge pull request #165 from vbanos/stderr-stdout-exception-handling
Improve exception handling when reading STDIN/STDERR
2019-09-24 12:03:06 -07:00
Vangelis Banos
f42ff08da1 Improve exception handling when reading STDIN/STDERR
When the chrome process dies and we try to read STDIN/STDERR, we get
`ValueError: I/O operation on closed file` or
`OSError: [Errno 9] Bad file descriptor`.

We modify `readline_nonblock` method to return the buffer it read up to
this point.
2019-09-19 20:08:55 +00:00
Vangelis Banos
0b28a4a57f More accurate JS behavior timeout
If you use a JS behavior timeout smaller than 7 sec, the JS behavior
will always need 7 sec because `sleep(7)` is hard-coded there.

We make a minor addition to use `min(timeout, 7)` for sleep so it will
finish faster when using a smaller JS behavior timeout.
2019-08-22 21:15:44 +00:00
Noah Levitt
16f886259d
Merge pull request #158 from galgeek/aitfive-1668-soundcoud
capture soundcloud user page before capturing tracks
2019-08-15 15:46:55 -07:00
Noah Levitt
94cd6cacb6 bump version after merge 2019-07-18 11:07:27 -07:00
Noah Levitt
726c6effed
Merge pull request #157 from vbanos/block-amp-analytics
Block AMP analytics JS script
2019-07-18 11:07:09 -07:00
Barbara Miller
9cc60449d7 skip downloading tracks from soundcloud user page 2019-07-17 17:45:02 -07:00
Vangelis Banos
6bd4fd6532 Block AMP analytics JS script
AMP analytics is part of Google analytics. We need to block it for
similar reasons.

AMP analytics reference:

https://developers.google.com/analytics/devguides/collection/amp-analytics/
2019-06-26 21:19:35 +00:00
Noah Levitt
8107abd804
Merge pull request #154 from vbanos/fix-brozzling-test
Fix test_brozzling::httpd fixture
1.5.6
2019-05-16 14:23:04 -07:00
Noah Levitt
5fdb2dd39c documentation tweak 2019-05-16 14:03:43 -07:00
Noah Levitt
aa2d491009 i don't know where pyyaml 5.8 came from 2019-05-16 01:29:05 -07:00
Noah Levitt
42ddfba923
Merge pull request #150 from nlevitt/purge-old
Purge old
2019-05-16 00:29:58 -07:00
Noah Levitt
40331f02ba
Merge pull request #153 from vbanos/warn-deprecated
logging.warn is deprecated and replaced by logging.warning
2019-05-16 00:27:22 -07:00
Noah Levitt
f8db17ce3d bump version after merge 2019-05-16 00:22:29 -07:00
Noah Levitt
eb34bebb91
Merge pull request #149 from nlevitt/travis-py37
trying to make this work with xenial for travis
2019-05-16 00:22:08 -07:00
Noah Levitt
c651bcdd18 remove some travis-ci debugging stuff 2019-05-16 00:21:28 -07:00
Noah Levitt
0a1360ab25 don't use localhost for test http server...
... because apparently sometimes chromium bypasses the proxy for local
addresses
2019-05-15 18:49:18 -07:00
Noah Levitt
f8165dc02b work around pytest issue until fix is out
https://github.com/pytest-dev/pytest/issues/5257
2019-05-15 18:46:21 -07:00
Vangelis Banos
a1f9122317 Fix test_brozzling::httpd fixture
We used `self.headers.getheader` which no longer works. We replace it
with `self.headers.get`.

We change the code to write binary data to `self.wfile` because we get
an exception for writing str and/or None.
2019-05-14 16:29:52 +00:00
Vangelis Banos
a2ac3a0374 logging.warn is deprecated and replaced by logging.warning
We replace it everywhere in the code base.
2019-05-14 12:10:59 +00:00
Noah Levitt
ee8ef23f0c fix mistake in job-conf.rst 2019-04-30 10:49:48 -07:00
Noah Levitt
411b3f266a bump version after merge 2019-04-09 22:07:51 +00:00
Noah Levitt
d4386491df
Merge pull request #151 from nlevitt/no-cerberus-normalize
don't attempt cerberus normalization
2019-04-09 15:06:17 -07:00
Noah Levitt
5385232b40 don't attempt cerberus normalization
which encumbers the validation with additional requirements,
specifically makes it difficult to validate a subclass of `dict` because
it expects a constructor that works like dict.__init__()
2019-04-09 01:45:37 -07:00
Noah Levitt
8dfd92cf7f fix this utility 2019-04-09 01:44:14 -07:00
Noah Levitt
433b201b52 use logging.warning() to quiet py37 warnings 2019-04-09 01:43:38 -07:00
Noah Levitt
dfd9d9ecdd omfg 2019-04-04 17:22:15 -07:00
Noah Levitt
fd0fe811e9 so little output from chromium-browser :(
https://travis-ci.org/internetarchive/brozzler/jobs/515942434

could it be problems running as this other user?
2019-04-04 16:09:21 -07:00
Noah Levitt
55541be9e9 let's see chromium output inside brozzler-worker
using --trace, because chromium seems to be working ok when we just run
it
2019-04-04 15:11:24 -07:00
Noah Levitt
58d1d1c429 chromium-browser with no args isn't dying at start
what about with all the args?
2019-04-04 14:38:29 -07:00
Noah Levitt
473e891fb4 not sure if --disable-extensions did something 2019-04-04 13:34:45 -07:00
Noah Levitt
6d145c87c8 chromium-browser --disable-extensions ? 2019-04-04 13:24:12 -07:00
Noah Levitt
0d46d8ce19 still trying to figure out what's up with chromium 2019-04-04 13:15:17 -07:00
Noah Levitt
45ac12117a maybe Xvnc.log will tell us something 2019-04-04 13:09:02 -07:00
Noah Levitt
8303fd3ab3 guessing DISPLAY was the issue here
https://travis-ci.org/internetarchive/brozzler/jobs/515882174#L610
2019-04-04 12:50:50 -07:00
Noah Levitt
899794f2da debug what's going on with chromium in travis
see https://travis-ci.org/internetarchive/brozzler/jobs/514858838
(unroll "sudo cat /var/log/brozzler-worker.log")

2019-04-02 20:16:01,792 18595 CRITICAL BrozzlingThread:42073 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:412) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 379, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 215, in brozzle_page
    browser, site, page, on_screenshot, on_request)
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 292, in _browse_page
    cookie_db=site.get('cookie_db'))
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/browser.py", line 341, in start
    self.websock_url = self.chrome.start(**kwargs)
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/chrome.py", line 200, in start
    return self._websocket_url()
  File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/chrome.py", line 247, in _websocket_url
    raise e
Exception: chrome process died with status 1
2019-04-04 12:38:46 -07:00
Noah Levitt
9459ed40d0 fix typo 2019-04-04 12:38:41 -07:00
Noah Levitt
68ce9eac76 debugging travis-ci is a slow process 2019-04-02 13:05:36 -07:00
Noah Levitt
85c6ac0ab2 fix next travis-ci problem 2019-04-02 12:05:08 -07:00
Noah Levitt
06e072a716 update some dependencies 2019-04-02 17:58:35 +00:00
Noah Levitt
8b6e5cbfb9 new option brozzler-purge --finished-before=... 2019-04-02 17:58:13 +00:00
Noah Levitt
9c658cddf7 fix a couple of svc definitions 2019-03-24 16:06:36 -07:00
Noah Levitt
48bb03418f daemontools 2019-03-23 00:26:39 -07:00