1386 Commits

Author SHA1 Message Date
Barbara Miller
218a49e824 stealth for brozzler_worker 2022-06-22 14:14:50 -07:00
Barbara Miller
de8d67e1e7
bump version 2022-06-20 13:44:42 -07:00
Barbara Miller
fe0aaa1ff6
Merge pull request #246 from vbanos/stealth
Looks good, thank you, @vbanos!
2022-06-20 13:43:25 -07:00
Vangelis Banos
7a12925004 Add stealth parameter to avoid antibot systems
The aim is to prevent Brozzler detection and blocking by antibot
systems. To do that, we need to run some JS before any other code runs
on page load and mock specific browser attributes which indicate that
Brozzler is a bot.

We add the option `stealth` in `Browser`, `brozzler.cli` and
`BrozzlerWorker`. It is disabled by default.

If enabled, we run `stealth.js` which is executed before anything else
on the page via `Page.addScriptToEvaluateOnNewDocument`.

For now, we mock only the graphics driver attributes.
If this is OK, we can add more antibot evasions in the same script.

There are many antibot tests, we are using this: https://bot.sannysoft.com/

Inspired mainly by:
https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
2022-06-17 10:53:12 +00:00
Barbara Miller
ddf7cb4cbc
bump version 2022-06-09 15:14:21 -07:00
Barbara Miller
f2d70e1e25
Merge pull request #245 from internetarchive/yt-dlp-log
yt-dlp: use 'youtube_dl' logger
2022-06-09 15:12:51 -07:00
Barbara Miller
14466a7fb3 'youtube_dl' logger 2022-06-08 14:30:32 -07:00
Adam Miller
1de63f0aea
Merge pull request #244 from internetarchive/yt-dlp-skip-live
yt-dlp should skip live streams
2022-04-27 15:29:07 -07:00
Adam Miller
66252e17c3
Merge pull request #243 from internetarchive/adds-hop-path-support
Adds hop path support
2022-04-26 12:10:43 -07:00
Adam Miller
eef8a1c432
Bump version 2022-04-26 09:55:08 -07:00
Adam Miller
05826942a9 Style fix 2022-04-20 22:49:18 +00:00
Barbara Miller
b693b8713f skip live streams 2022-04-03 17:50:27 -07:00
Adam Miller
cd16985724 Refactor of hop referrer passing 2022-03-24 21:38:47 +00:00
Barbara Miller
70bb544389
bump version 2022-03-22 13:59:48 -07:00
Barbara Miller
7ee6ea50d1
Merge pull request #242 from internetarchive/yt-dlp-03
for the record, @avdempsey ok'd this elsewhere
2022-03-22 10:23:58 -07:00
Barbara Miller
d5e41bf9ef skip vimeo special case 2022-03-22 10:00:18 -07:00
Barbara Miller
c52b4af608 vimeo/M3u8 handling, better logging 2022-03-21 20:26:20 -07:00
Barbara Miller
d67a05572d prefer video+audio files, debug postprocessor hook 2022-03-21 13:28:08 -07:00
Adam Miller
f4a9e77b06 Catching edge cases that were avoiding setting hop path information 2022-03-03 00:15:20 +00:00
Barbara Miller
7ea7e543a6
Merge pull request #241 from internetarchive/yt-dlp-too
yt-dlp for brozzler
2022-02-25 15:26:33 -08:00
Barbara Miller
25bb65a635 brozzler/ydl.py updates 2022-02-23 22:34:47 -08:00
Barbara Miller
0305db5e69 yt_dlp, not youtube-dl 2022-02-23 22:32:00 -08:00
Adam Miller
d61cec399e Merge branch 'master' into adds-hop-path-support 2022-02-09 18:10:37 +00:00
Barbara Miller
d9ac067e41
bump version, copyright statment 2022-01-18 17:45:58 -08:00
Barbara Miller
de199e789e
Merge pull request #237 from vbanos/disable-breakpad
Thanks, @vbanos!
2022-01-18 17:43:45 -08:00
Vangelis Banos
fdc84fb848 Add chrome options --disable-sync and --disable-breakpad
`--disable-sync` disables syncing to a Google account.

`--disable-breakpad` disables crashdump collection.

These options aren't useful for Brozzler. They are already used in
puppeteer
https://github.com/puppeteer/puppeteer/blob/main/src/node/Launcher.ts#L211

Docs in chrome-launcher
https://github.com/GoogleChrome/chrome-launcher/blob/master/docs/chrome-flags-for-tools.md
2022-01-18 10:09:39 +00:00
Alex Dempsey
427908e821
Merge pull request #233 from cclauss/codespell
Fix typos
2021-10-12 12:34:37 -07:00
Christian Clauss
a5ed291e65 Fix typos 2021-10-12 10:19:48 +02:00
Adam Miller
0f72233f3b Adding support for hop path information to be stored and passed along to warcprox 2021-08-31 19:44:55 +00:00
Barbara Miller
4f301f4e03
Merge pull request #225 from internetarchive/wt-376-yt-user-page-fix
Added new extractor type to brozzler's youtube-dl playlist handling
2021-06-08 14:43:42 -07:00
Barbara Miller
c311fbb41f
bump version, update copyright 2021-05-25 17:14:21 -07:00
Barbara Miller
b59c4395ed
Merge pull request #223 from vbanos/fix-AddressValueError
Skip invalid outlink
2021-05-25 17:12:35 -07:00
Vangelis Banos
7aabc5f655 Skip invalid outlink
When one of the outlinks is `http://-1/` `urlcanon.whatwg` raises an
unhandled exception `ipaddress.AddressValueError` and the capture fails.

We can skip the problematic outlink and keep the rest without crashing.
2021-05-23 11:31:47 +00:00
Pravin Visakan
eabdeb0238 Added user page extractor type to ytdl monkeypatch 2021-05-04 16:50:38 -07:00
Barbara Miller
0f27c9995a
bump version 2020-10-29 17:12:14 -07:00
jkafader
5005c619f6
Merge pull request #211 from internetarchive/galgeek-websocket-url-timeout
configurable websocket url timeout, default 60
2020-10-29 17:08:48 -07:00
Barbara Miller
11c5cfa865 add param for Chrome.start 2020-10-21 15:39:46 -07:00
Barbara Miller
dc50fe1db2
Merge pull request #212 from internetarchive/bump-version-to-1.5.23
bump version after merge
2020-10-13 15:21:18 -07:00
Barbara Miller
052c3552ca
bump version after merge 2020-10-13 15:19:50 -07:00
Barbara Miller
f2ebdca597
configurable websocket url timeout, default 60 2020-10-13 15:12:32 -07:00
Barbara Miller
bb7594a14d
Merge pull request #209 from vbanos/outlinks-timeout
Thanks, @vbanos!
2020-10-13 15:01:55 -07:00
Vangelis Banos
8addaf31d5 Add option extract_outlinks_timeout
`Browser.extract_outlinks` has a default `timeout=60` parm that cannot be
changed in any way. (It is always invoked using `extract_outlinks()`.

We add param `extract_outlinks_timeout=60` to `BrozzlerWorker` and
`Browser.browse_page` to allow that.
2020-10-04 15:39:30 +00:00
Barbara Miller
18d3f5f930
Merge pull request #208 from internetarchive/galgeek-patch-2
based on PR #207 — thanks @cclaus!
2020-09-21 18:06:03 -07:00
Barbara Miller
297eaac6dd
update travis.yml and test! 2020-09-21 17:08:39 -07:00
Barbara Miller
c744bb2f92
update copyright 2020-09-01 19:05:21 -07:00
Barbara Miller
d599778c27
Merge pull request #206 from internetarchive/galgeek-patch-1
bump version after merge
2020-08-05 09:24:28 -07:00
Barbara Miller
84d6bb43fa
bump version after merge 2020-08-05 09:23:58 -07:00
Barbara Miller
5a6ecb09d5
Merge pull request #205 from vbanos/behavior-timeout-zero
Skip loading behavior when behavior_timeout=0

behavior_timeout is an existing parameter to `Browser.browse_page`
2020-08-04 16:18:58 -07:00
Neil Minton
12913cccf0
Merge pull request #204 from galgeek/noplaylist-ydl
youtube-dl option noplaylist: True
2020-08-04 14:12:14 -04:00
Vangelis Banos
8b10587031 Skip loading behavior when behavior_timeout=0
The user may set `behavior_timeout=0`. This means that they don't want
to run the behavior. As it is now, Brozzler will invoke
`brozzler.behavior_script` to load the script and `self.run_behavior`
to execute it.
We will run the behavior using `Runtime.evaluate` but then it will be
terminated immediately because of timeout=0.

It is better to skip behavior loading and running when
`behavior_timeout=0`.
2020-08-04 06:27:21 +00:00