1345 Commits

Author SHA1 Message Date
Barbara Miller
a4195e1a83
bump version 2022-08-12 10:41:48 -07:00
Barbara Miller
50c2b424c2
Merge pull request #248 from vbanos/stealth2
Add more stealth evasions
2022-08-12 10:40:34 -07:00
Barbara Miller
60645f7f37
bump version 2022-08-05 15:58:55 -07:00
Barbara Miller
0b60a2e2f3
Merge pull request #249 from internetarchive/blocks-shrink
@adam-miller ok'd this elsewhere
2022-08-05 15:36:34 -07:00
Barbara Miller
7edb0f11b0 and decode() 2022-08-04 16:04:37 -07:00
Barbara Miller
a5ee78e662 zlib compression 2022-08-04 11:16:38 -07:00
Vangelis Banos
b5b7d9d52b Add more stealth evasions
Set `navigator.platform = 'Win32'` instead of the default `Linux` as we
usualy run Brozzler on Linux.

Randomize the `navigator.deviceMemory` and
`navigator.hardwareConcurrency` to avoid browser fingerprinting.

Define `window.Notification` which is not defined because we run Chrome
with CLI parameter `--disable-notifications`.
2022-07-29 11:21:08 +00:00
Barbara Miller
39eb80567d
bump version 2022-06-22 16:13:59 -07:00
Barbara Miller
fa59a88a26
Merge pull request #247 from internetarchive/stealth-too 2022-06-22 16:13:12 -07:00
Barbara Miller
218a49e824 stealth for brozzler_worker 2022-06-22 14:14:50 -07:00
Barbara Miller
de8d67e1e7
bump version 2022-06-20 13:44:42 -07:00
Barbara Miller
fe0aaa1ff6
Merge pull request #246 from vbanos/stealth
Looks good, thank you, @vbanos!
2022-06-20 13:43:25 -07:00
Vangelis Banos
7a12925004 Add stealth parameter to avoid antibot systems
The aim is to prevent Brozzler detection and blocking by antibot
systems. To do that, we need to run some JS before any other code runs
on page load and mock specific browser attributes which indicate that
Brozzler is a bot.

We add the option `stealth` in `Browser`, `brozzler.cli` and
`BrozzlerWorker`. It is disabled by default.

If enabled, we run `stealth.js` which is executed before anything else
on the page via `Page.addScriptToEvaluateOnNewDocument`.

For now, we mock only the graphics driver attributes.
If this is OK, we can add more antibot evasions in the same script.

There are many antibot tests, we are using this: https://bot.sannysoft.com/

Inspired mainly by:
https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
2022-06-17 10:53:12 +00:00
Barbara Miller
ddf7cb4cbc
bump version 2022-06-09 15:14:21 -07:00
Barbara Miller
f2d70e1e25
Merge pull request #245 from internetarchive/yt-dlp-log
yt-dlp: use 'youtube_dl' logger
2022-06-09 15:12:51 -07:00
Barbara Miller
14466a7fb3 'youtube_dl' logger 2022-06-08 14:30:32 -07:00
Adam Miller
1de63f0aea
Merge pull request #244 from internetarchive/yt-dlp-skip-live
yt-dlp should skip live streams
2022-04-27 15:29:07 -07:00
Adam Miller
66252e17c3
Merge pull request #243 from internetarchive/adds-hop-path-support
Adds hop path support
2022-04-26 12:10:43 -07:00
Adam Miller
eef8a1c432
Bump version 2022-04-26 09:55:08 -07:00
Adam Miller
05826942a9 Style fix 2022-04-20 22:49:18 +00:00
Barbara Miller
b693b8713f skip live streams 2022-04-03 17:50:27 -07:00
Adam Miller
cd16985724 Refactor of hop referrer passing 2022-03-24 21:38:47 +00:00
Barbara Miller
70bb544389
bump version 2022-03-22 13:59:48 -07:00
Barbara Miller
7ee6ea50d1
Merge pull request #242 from internetarchive/yt-dlp-03
for the record, @avdempsey ok'd this elsewhere
2022-03-22 10:23:58 -07:00
Barbara Miller
d5e41bf9ef skip vimeo special case 2022-03-22 10:00:18 -07:00
Barbara Miller
c52b4af608 vimeo/M3u8 handling, better logging 2022-03-21 20:26:20 -07:00
Barbara Miller
d67a05572d prefer video+audio files, debug postprocessor hook 2022-03-21 13:28:08 -07:00
Adam Miller
f4a9e77b06 Catching edge cases that were avoiding setting hop path information 2022-03-03 00:15:20 +00:00
Barbara Miller
7ea7e543a6
Merge pull request #241 from internetarchive/yt-dlp-too
yt-dlp for brozzler
2022-02-25 15:26:33 -08:00
Barbara Miller
25bb65a635 brozzler/ydl.py updates 2022-02-23 22:34:47 -08:00
Barbara Miller
0305db5e69 yt_dlp, not youtube-dl 2022-02-23 22:32:00 -08:00
Adam Miller
d61cec399e Merge branch 'master' into adds-hop-path-support 2022-02-09 18:10:37 +00:00
Barbara Miller
d9ac067e41
bump version, copyright statment 2022-01-18 17:45:58 -08:00
Barbara Miller
de199e789e
Merge pull request #237 from vbanos/disable-breakpad
Thanks, @vbanos!
2022-01-18 17:43:45 -08:00
Vangelis Banos
fdc84fb848 Add chrome options --disable-sync and --disable-breakpad
`--disable-sync` disables syncing to a Google account.

`--disable-breakpad` disables crashdump collection.

These options aren't useful for Brozzler. They are already used in
puppeteer
https://github.com/puppeteer/puppeteer/blob/main/src/node/Launcher.ts#L211

Docs in chrome-launcher
https://github.com/GoogleChrome/chrome-launcher/blob/master/docs/chrome-flags-for-tools.md
2022-01-18 10:09:39 +00:00
Alex Dempsey
427908e821
Merge pull request #233 from cclauss/codespell
Fix typos
2021-10-12 12:34:37 -07:00
Christian Clauss
a5ed291e65 Fix typos 2021-10-12 10:19:48 +02:00
Adam Miller
0f72233f3b Adding support for hop path information to be stored and passed along to warcprox 2021-08-31 19:44:55 +00:00
Barbara Miller
4f301f4e03
Merge pull request #225 from internetarchive/wt-376-yt-user-page-fix
Added new extractor type to brozzler's youtube-dl playlist handling
2021-06-08 14:43:42 -07:00
Barbara Miller
c311fbb41f
bump version, update copyright 2021-05-25 17:14:21 -07:00
Barbara Miller
b59c4395ed
Merge pull request #223 from vbanos/fix-AddressValueError
Skip invalid outlink
2021-05-25 17:12:35 -07:00
Vangelis Banos
7aabc5f655 Skip invalid outlink
When one of the outlinks is `http://-1/` `urlcanon.whatwg` raises an
unhandled exception `ipaddress.AddressValueError` and the capture fails.

We can skip the problematic outlink and keep the rest without crashing.
2021-05-23 11:31:47 +00:00
Pravin Visakan
eabdeb0238 Added user page extractor type to ytdl monkeypatch 2021-05-04 16:50:38 -07:00
Barbara Miller
0f27c9995a
bump version 2020-10-29 17:12:14 -07:00
jkafader
5005c619f6
Merge pull request #211 from internetarchive/galgeek-websocket-url-timeout
configurable websocket url timeout, default 60
2020-10-29 17:08:48 -07:00
Barbara Miller
11c5cfa865 add param for Chrome.start 2020-10-21 15:39:46 -07:00
Barbara Miller
dc50fe1db2
Merge pull request #212 from internetarchive/bump-version-to-1.5.23
bump version after merge
2020-10-13 15:21:18 -07:00
Barbara Miller
052c3552ca
bump version after merge 2020-10-13 15:19:50 -07:00
Barbara Miller
f2ebdca597
configurable websocket url timeout, default 60 2020-10-13 15:12:32 -07:00
Barbara Miller
bb7594a14d
Merge pull request #209 from vbanos/outlinks-timeout
Thanks, @vbanos!
2020-10-13 15:01:55 -07:00