After analysing capture logs, we see that we didn't block many
google-analytics related URLS which are used for web statistics. We add
these to the blocked URLs.
In addition, we improve existing block rules. We used to block
`*google-analytics.com/analytics.js` but many sites used some kind of
param in the end so these URLs weren't blocked. We add `*` in the end of
the existing rules to block these cases as well.
Due to the fact that we run JS behaviors before we capture the
screenshot, the browser could be scrolled down in the page. When we
don't capture the full page, we may get a screenshot of the bottom part of
the page and not the top. To fix that we run `window.scroll(0, 0)`
before capturing the screenshot.
We change method `BrozzlerWorker.full_and_thumb_jpegs` to
`BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now
from the browser after our changes at `Browser.screenshot`.
`thumb_jpeg` only returns a thumbnail now. There is no need to read PNG
and convert to JPEG. This means that screenshots will be a bit faster
now :)
There were some weird screeshots when invoking `try_screenshot` in the end
after `visit_hashtags` and `extract_outlinks`. The screenshot was
distorted.
Add option `full_page` to `Browser.screenshot`. The default behavior
remains the same.
We get inspiration from puppeteer to capture a screenshot of the full
page:
https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898
Add option `screenshot_full_page=False` to `Browser.browse_page` to use
the new feature when capturing a page.
When capturing a page, we receive a LOT of messages from chrome.
Examining these message, we see that we can reduce them a bit to speed
up Brozzler.
We always use `Console.enable` which returns all browser console output.
Also, we always use `Runtime.enable`. Doc says:
https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable
Enables reporting of execution contexts creation by means of
executionContextCreated event. When the reporting gets enabled the event
will be sent immediately for each existing execution context.
These outputs are useful when debugging but not in production.
If we disable them, we reduce the websocket traffic and improve
performance. With this PR, we enable them only when the current logging
level is `DEBUG`.
Counting the number of messages before and after the change, we see
improvements like:
https://www.gnome.org/technologies/ 220 -> 202 messages.
https://www.whitehouse.gov/issues/budget-spending/ 203 -> 189 messages
When the chrome process dies and we try to read STDIN/STDERR, we get
`ValueError: I/O operation on closed file` or
`OSError: [Errno 9] Bad file descriptor`.
We modify `readline_nonblock` method to return the buffer it read up to
this point.
If you use a JS behavior timeout smaller than 7 sec, the JS behavior
will always need 7 sec because `sleep(7)` is hard-coded there.
We make a minor addition to use `min(timeout, 7)` for sleep so it will
finish faster when using a smaller JS behavior timeout.
We used `self.headers.getheader` which no longer works. We replace it
with `self.headers.get`.
We change the code to write binary data to `self.wfile` because we get
an exception for writing str and/or None.
which encumbers the validation with additional requirements,
specifically makes it difficult to validate a subclass of `dict` because
it expects a constructor that works like dict.__init__()