https://github.com/internetarchive/brozzler/pull/183#issuecomment-560562807
"We've had a number of cases where a page kept failing for one reason or
another, and it's bad. We can end up with tons of duplicate captures,
the crawl is not able to make progress, and the overall performance of
the cluster is impacted in cases like yours, where a browser is sitting
there doing nothing for five minutes."
If ENV var `BROZZLER_EXTRA_CHROME_ARGS` is set, pass its contents as
extra chromium cli options.
Remove `--no-sandbox` option. Its not good from a security point of
view.
When trying to run Brozzler in docker, we get the following error:
```
Failed to move to new namespace: PID namespaces supported, Network
namespace supported, but failed: errno = Operation not permitted
Trace/breakpoint trap
```
This happens because chromium uses sandboxing for increased security by
default and its not supported when running in a container.
Adding chromium option `--no-sandbox` fixes the problem.
This issue is common, I found various reports about it like this:
https://github.com/Zenika/alpine-chrome/issues/33
After analysing capture logs, we see that we didn't block many
google-analytics related URLS which are used for web statistics. We add
these to the blocked URLs.
In addition, we improve existing block rules. We used to block
`*google-analytics.com/analytics.js` but many sites used some kind of
param in the end so these URLs weren't blocked. We add `*` in the end of
the existing rules to block these cases as well.
Due to the fact that we run JS behaviors before we capture the
screenshot, the browser could be scrolled down in the page. When we
don't capture the full page, we may get a screenshot of the bottom part of
the page and not the top. To fix that we run `window.scroll(0, 0)`
before capturing the screenshot.
We change method `BrozzlerWorker.full_and_thumb_jpegs` to
`BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now
from the browser after our changes at `Browser.screenshot`.
`thumb_jpeg` only returns a thumbnail now. There is no need to read PNG
and convert to JPEG. This means that screenshots will be a bit faster
now :)
There were some weird screeshots when invoking `try_screenshot` in the end
after `visit_hashtags` and `extract_outlinks`. The screenshot was
distorted.