This solves one of the existing limitations of our outlink gathering
system: we run it once, after all behaviours have completed. For
some interactive pages, particularly single-page apps with paginated
data, it means that we'll completely miss that content since it won't
be in the DOM anymore by the time we get around to it.
In a previous PR, #433, I made the outlink gathering logic reusable
by ensuring it's possible for us to dynamically call the outlink
gathering function at any time. In addition, in #429, I made it
possible for behaviours to return outlinks to Brozzler; if the
behaviour chooses to return outlinks, then Brozzler will add them to
the set it extracts after behavours complete.
This branch uses both of those by introducing new functionality in
behaviours. We always inject the outlink gathering code before running
behaviours, so they can now run it at will. It also introduces two new
behaviour parameters:
* `extractOutlinks` - If set to `true`, then the behaviour script will
call the outlink gathering logic and return any outlinks it harvested
to Brozzler. Defaults to `false`.
* `extractOutlinksInLoop` - If set to `true`, then the behaviour script
will gather outlinks every iteration of the loop. This combines great
with `repeatSameElement`, since it means the behaviour script can
click a `next` pagination button and then immediately gather whatever
new outlinks have appeared on the page. Defaults to `false`.
Currently, our JavaScript outlink extraction happens purely via our
non-configurable extract-outlinks.js script. However, given that
many sites can have unpredictable behaviour we may want special
handling for, it would be great to let us configure this on a per-
site basis. We already have a system for this for interacting with
sites using our behaviour system; if we expand this to also provide
outlinks, we can give ourselves a much more flexible system to
handle complex or special-case websites.
This extends the behaviour system so that we can now return a
JavaScript object with information about the site. That object
should contain at least the "finished" key, which is a boolean that
works like the simple boolean returned by older versions. The
object can additionally contain an "outlinks" key which, if present,
should be an array of links for brozzler to handle as outlinks.
I've retained backwards compatibility by checking to see if the
returned object is a boolean and handling it like we did previously.
We inconsistently assigned this to a `list` if we skipped outlink
extraction instead of a `frozenset` like we do when actually
fetching them. I've annotated the method that returns it for
clarity.
The outlinks are collected as HTMLAnchorElement objects. The previous
version handled stringifying them by collecting the entire set of
objects into a single newline-delimited string, then splitting it back
up again in Python. It seems easier to just send back a JSON array of
strings and have Python iterate over them that way.
Right now, our loop config in behaviours is purely based on the loop
limit. We can be smarter than that, though - sometimes we can know
when it's safe to *stop* clicking the same element over and over by
running another selector.
A good example of this is page pagination. Most sites with
JavaScript-based pagination will also disable the "next" button once
the browser has reached the last page of results. Using that, we can
configure a behaviour that looks something like this:
```yaml
- url_regex: '^.*$'
behavior_js_template: umbraBehavior.js.j2
request_idle_timeout_sec: 10
default_parameters:
interval: 10
actions:
- selector: .next
repeatSameElement: true
repeatUntilSelector: .next[disabled]
# the repeatUntil selector makes long loops safe
# since we'll break out of the loop once we hit the
# disabled button!
limit: 1000
```
The outlink JS always executes its function, which means we can't safely
inject it if we want other JavaScript to be able to execute outlink
extraction. This updates it to instead expose a single outlink gathering
function, which is then explicitly called by the outlink gatherer after
injecting the JS.
Unsafe legacy renegotiation is disabled by default in requests, but
it's needed to access some webpages that real browsers are able to
safely access. This leaves it disabled by default when fetching
headers, while logging and retrying with it enabled if that fails.
robots.txt fetching is always done with legacy renegotiation on.
This replaces the previous yt-dlp auto-test and merge workflow to use
Renovate instead of Dependabot, since we've found that Dependabot is
no longer able to update our dependencies.