The latest `warcprox` 2.5.1 requirement
https://github.com/internetarchive/doublethink/blob/Py311/setup.py
requires `rethinkdb>=2.4.9,<2.5` but Brozzler has `rethinkdb>=2.3,<2.4`
and this creates a conflict if they are in the same virtualenv.
We update Brozzler to use the same dependency.
The screenshot is an additional thing we do when the capture is
successful. Why get a screenshot of 4xx/5xx responses? Its just extra
system load.
We already got the capture for archiving reasons.
add socket_timeout opt for yt-dlp
Mike Wilson reviewed this via slack. We've agreed that it may be helpful to offer this setting as a command line option for brozzler, when this code is updated again.
Set `navigator.platform = 'Win32'` instead of the default `Linux` as we
usualy run Brozzler on Linux.
Randomize the `navigator.deviceMemory` and
`navigator.hardwareConcurrency` to avoid browser fingerprinting.
Define `window.Notification` which is not defined because we run Chrome
with CLI parameter `--disable-notifications`.
The aim is to prevent Brozzler detection and blocking by antibot
systems. To do that, we need to run some JS before any other code runs
on page load and mock specific browser attributes which indicate that
Brozzler is a bot.
We add the option `stealth` in `Browser`, `brozzler.cli` and
`BrozzlerWorker`. It is disabled by default.
If enabled, we run `stealth.js` which is executed before anything else
on the page via `Page.addScriptToEvaluateOnNewDocument`.
For now, we mock only the graphics driver attributes.
If this is OK, we can add more antibot evasions in the same script.
There are many antibot tests, we are using this: https://bot.sannysoft.com/
Inspired mainly by:
https://www.npmjs.com/package/puppeteer-extra-plugin-stealth