Commit graph

1873 commits

Author SHA1 Message Date
Misty De Méo
2f623f7e6d behaviors: optionally collect outlinks while looping
This solves one of the existing limitations of our outlink gathering
system: we run it once, after all behaviours have completed. For
some interactive pages, particularly single-page apps with paginated
data, it means that we'll completely miss that content since it won't
be in the DOM anymore by the time we get around to it.

In a previous PR, #433, I made the outlink gathering logic reusable
by ensuring it's possible for us to dynamically call the outlink
gathering function at any time. In addition, in #429, I made it
possible for behaviours to return outlinks to Brozzler; if the
behaviour chooses to return outlinks, then Brozzler will add them to
the set it extracts after behavours complete.

This branch uses both of those by introducing new functionality in
behaviours. We always inject the outlink gathering code before running
behaviours, so they can now run it at will. It also introduces two new
behaviour parameters:

* `extractOutlinks` - If set to `true`, then the behaviour script will
  call the outlink gathering logic and return any outlinks it harvested
  to Brozzler. Defaults to `false`.
* `extractOutlinksInLoop` - If set to `true`, then the behaviour script
  will gather outlinks every iteration of the loop. This combines great
  with `repeatSameElement`, since it means the behaviour script can
  click a `next` pagination button and then immediately gather whatever
  new outlinks have appeared on the page. Defaults to `false`.
2025-12-19 12:16:07 -08:00
Misty De Méo
b6078ceee7 behaviors: allow extracting outlinks
Currently, our JavaScript outlink extraction happens purely via our
non-configurable extract-outlinks.js script. However, given that
many sites can have unpredictable behaviour we may want special
handling for, it would be great to let us configure this on a per-
site basis. We already have a system for this for interacting with
sites using our behaviour system; if we expand this to also provide
outlinks, we can give ourselves a much more flexible system to
handle complex or special-case websites.

This extends the behaviour system so that we can now return a
JavaScript object with information about the site. That object
should contain at least the "finished" key, which is a boolean that
works like the simple boolean returned by older versions. The
object can additionally contain an "outlinks" key which, if present,
should be an array of links for brozzler to handle as outlinks.

I've retained backwards compatibility by checking to see if the
returned object is a boolean and handling it like we did previously.
2025-12-19 12:16:02 -08:00
Misty De Méo
874163beec browser: correctly set outlinks type
We inconsistently assigned this to a `list` if we skipped outlink
extraction instead of a `frozenset` like we do when actually
fetching them. I've annotated the method that returns it for
clarity.
2025-12-19 12:15:58 -08:00
Misty De Méo
ae779eefa8 outlinks: simplify outlink parsing
The outlinks are collected as HTMLAnchorElement objects. The previous
version handled stringifying them by collecting the entire set of
objects into a single newline-delimited string, then splitting it back
up again in Python. It seems easier to just send back a JSON array of
strings and have Python iterate over them that way.
2025-12-19 12:15:49 -08:00
Misty De Méo
9da38cc40b behaviors: add loop control selector
Right now, our loop config in behaviours is purely based on the loop
limit. We can be smarter than that, though - sometimes we can know
when it's safe to *stop* clicking the same element over and over by
running another selector.

A good example of this is page pagination. Most sites with
JavaScript-based pagination will also disable the "next" button once
the browser has reached the last page of results. Using that, we can
configure a behaviour that looks something like this:

```yaml
- url_regex: '^.*$'
  behavior_js_template: umbraBehavior.js.j2
  request_idle_timeout_sec: 10
  default_parameters:
    interval: 10
    actions:
      - selector: .next
        repeatSameElement: true
        repeatUntilSelector: .next[disabled]
        # the repeatUntil selector makes long loops safe
        # since we'll break out of the loop once we hit the
        # disabled button!
        limit: 1000
```
2025-12-19 12:15:45 -08:00
Misty De Méo
cca8eea285 browser: make outlink gathering reusable
The outlink JS always executes its function, which means we can't safely
inject it if we want other JavaScript to be able to execute outlink
extraction. This updates it to instead expose a single outlink gathering
function, which is then explicitly  called by the outlink gatherer after
injecting the JS.
2025-12-19 12:15:41 -08:00
Misty De Méo
dc2873a3bc deps: urllib3 2.6.1 / brotli 1.2.0 / requests 2.32.5 2025-12-19 12:15:30 -08:00
renovate[bot]
2ef65ca0dd fix(deps): update dependency yt-dlp to v2025.12.8 2025-12-08 00:44:01 +00:00
Adam Miller
aa36cb930b chore: address ci formatting 2025-11-13 15:47:11 -08:00
Adam Miller
58ba082a21 fix: We were applying the max_sites_to_claim filter too early. Many sites in a single crawl prevent claimable sites from getting through. 2025-11-13 15:46:59 -08:00
renovate[bot]
5222fdacdc fix(deps): update dependency yt-dlp to v2025.11.12 2025-11-12 03:49:54 +00:00
Misty De Méo
3757d0681a ci: install deno 2025-11-07 14:48:15 -08:00
renovate[bot]
0f8ea3cb6d fix(deps): update dependency yt-dlp to v2025.10.22 2025-10-27 20:09:24 +00:00
Misty De Méo
90a8dc5ebe deps: bump minimum python to 3.10
3.9 is now EOL, and yt-dlp no longer supports it.
2025-10-27 12:57:35 -07:00
renovate[bot]
47d8d4d19e fix(deps): update dependency yt-dlp to v2025.10.14 2025-10-15 02:00:38 +00:00
Misty De Méo
da8b00016a ssl: allow fetching pages needing legacy renegotiation
Unsafe legacy renegotiation is disabled by default in requests, but
it's needed to access some webpages that real browsers are able to
safely access. This leaves it disabled by default when fetching
headers, while logging and retrying with it enabled if that fails.
robots.txt fetching is always done with legacy renegotiation on.
2025-10-02 10:54:00 -07:00
renovate[bot]
4aeca4ad91 fix(deps): update dependency yt-dlp to v2025.9.26 2025-09-27 00:38:46 +00:00
renovate[bot]
31cb217751 fix(deps): update dependency yt-dlp to v2025.9.23 2025-09-23 10:40:06 +00:00
renovate[bot]
87851d1ba1 fix(deps): update dependency yt-dlp to v2025.9.5 2025-09-06 16:06:14 +00:00
renovate[bot]
61d57e63c3 fix(deps): update dependency yt-dlp to v2025.8.27 2025-08-28 02:07:23 +00:00
Barbara Miller
cdd4d6500a bump qa version 2025-08-27 13:56:56 -07:00
Barbara Miller
fcf966fca4 Merge branch 'predup_type_playlist' into qa 2025-08-27 13:56:16 -07:00
Barbara Miller
cdf5db4f47 self._video_data = None sometimes 2025-08-27 13:37:05 -07:00
Barbara Miller
7c93eb7a8d if self._video_data 2025-08-27 11:36:40 -07:00
Barbara Miller
09fd339447 should_ytdlp predup only when worker.video_data 2025-08-27 11:27:17 -07:00
Barbara Miller
70a8837673 Merge branch 'predup_type_playlist' into qa 2025-08-27 11:00:21 -07:00
Barbara Miller
bade30b2ff ruff'd (added final comma 8|) 2025-08-25 11:34:32 -07:00
Barbara Miller
fef6622cb1 bump qa version 2025-08-25 11:24:44 -07:00
Barbara Miller
29549d2255 Merge branch 'predup_type_playlist' into qa 2025-08-25 11:23:45 -07:00
Barbara Miller
cd8ffa2e35 bump qa version 2025-08-25 10:52:13 -07:00
Barbara Miller
fc7fa6d0ae fix last buglet & tweak logging 2025-08-25 10:42:02 -07:00
renovate[bot]
f85e44a011 fix(deps): update dependency yt-dlp to v2025.8.22 2025-08-24 13:23:33 +00:00
renovate[bot]
252ee93319 chore(deps): update dependency yt-dlp to v2025.8.22 2025-08-23 01:55:49 +00:00
Barbara Miller
c8480ff8ab Merge branch 'predup_type_playlist' into qa 2025-08-22 18:18:50 -07:00
Barbara Miller
b7949b5d60 ruff'd 2025-08-22 18:04:49 -07:00
Barbara Miller
cd658907b3 ruff'd 2025-08-22 18:03:43 -07:00
Barbara Miller
a4f7ab5ddf Merge branch 'predup_type_playlist' into qa 2025-08-22 17:05:29 -07:00
Barbara Miller
ae67bcd4a1 fix buglets 2025-08-22 17:02:03 -07:00
Misty De Méo
3cd98d8944 deps: update lockfile 2025-08-22 13:29:58 -07:00
renovate[bot]
609f923d32 chore(deps): update dependency yt-dlp to v2025.8.20 2025-08-21 20:01:37 +00:00
Misty De Méo
e2eff5ebed ci: fix branch check 2025-08-21 12:49:27 -07:00
Misty De Méo
c5c44db119 ci: update renovate branch format 2025-08-21 12:41:45 -07:00
Misty De Méo
13328883b3 ci: migrate yt-dlp autotest to renovate
This replaces the previous yt-dlp auto-test and merge workflow to use
Renovate instead of Dependabot, since we've found that Dependabot is
no longer able to update our dependencies.
2025-08-21 12:38:59 -07:00
Misty De Méo
07b3c5e28d deps: pin yt-dlp to specific version 2025-08-21 11:38:57 -07:00
Barbara Miller
f091f51d48 bump qa version 2025-08-21 10:18:25 -07:00
Barbara Miller
6fbebdc0e9 Merge branch 'predup_type_playlist' into qa 2025-08-21 10:17:48 -07:00
Barbara Miller
67d0ac98c8 more better result handling for recent_video_capture_exists 2025-08-21 10:17:20 -07:00
Barbara Miller
d599a1aa1c bump qa version 2025-08-20 14:01:21 -07:00
Barbara Miller
9575df7b1c Merge branch 'predup_type_playlist' into qa 2025-08-20 14:00:50 -07:00
Barbara Miller
978858327b if result_tuple[0] 2025-08-20 13:56:08 -07:00