Commit graph

1748 commits

Author SHA1 Message Date
Misty De Méo
9d0c61fee6 deps: yt-dlp 2025.12.08
Some checks are pending
Python Formatting Check / formatting (push) Waiting to run
Tests / Run tests (push) Waiting to run
2026-01-06 14:58:28 -08:00
Adam Miller
b74d7d6925
Merge pull request #438 from internetarchive/adam/handle-chrome-error-pages
fix: attempting to detect chrome error pages and trigger retry logic
2026-01-06 13:35:56 -08:00
Adam Miller
7512c84e67 fix: reverting retry count change 2026-01-06 13:20:28 -08:00
Adam Miller
d0ef974313 chore: log chrome error page, and fix retry attempt comparison 2026-01-06 11:40:31 -08:00
Adam Miller
85267b8edb chore: handle PageConnectionError in CLI mode 2026-01-06 11:39:50 -08:00
Adam Miller
991f1741fa fix: attempting to detect chrome error pages and trigger retry logic 2025-12-23 15:06:03 -08:00
Misty De Méo
7e9d7f1130 behaviors: optionally collect outlinks while looping
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
This solves one of the existing limitations of our outlink gathering
system: we run it once, after all behaviours have completed. For
some interactive pages, particularly single-page apps with paginated
data, it means that we'll completely miss that content since it won't
be in the DOM anymore by the time we get around to it.

In a previous PR, #433, I made the outlink gathering logic reusable
by ensuring it's possible for us to dynamically call the outlink
gathering function at any time. In addition, in #429, I made it
possible for behaviours to return outlinks to Brozzler; if the
behaviour chooses to return outlinks, then Brozzler will add them to
the set it extracts after behavours complete.

This branch uses both of those by introducing new functionality in
behaviours. We always inject the outlink gathering code before running
behaviours, so they can now run it at will. It also introduces two new
behaviour parameters:

* `extractOutlinks` - If set to `true`, then the behaviour script will
  call the outlink gathering logic and return any outlinks it harvested
  to Brozzler. Defaults to `false`.
* `extractOutlinksInLoop` - If set to `true`, then the behaviour script
  will gather outlinks every iteration of the loop. This combines great
  with `repeatSameElement`, since it means the behaviour script can
  click a `next` pagination button and then immediately gather whatever
  new outlinks have appeared on the page. Defaults to `false`.
2025-12-19 12:14:47 -08:00
Misty De Méo
00edb56aa1 behaviors: allow extracting outlinks
Currently, our JavaScript outlink extraction happens purely via our
non-configurable extract-outlinks.js script. However, given that
many sites can have unpredictable behaviour we may want special
handling for, it would be great to let us configure this on a per-
site basis. We already have a system for this for interacting with
sites using our behaviour system; if we expand this to also provide
outlinks, we can give ourselves a much more flexible system to
handle complex or special-case websites.

This extends the behaviour system so that we can now return a
JavaScript object with information about the site. That object
should contain at least the "finished" key, which is a boolean that
works like the simple boolean returned by older versions. The
object can additionally contain an "outlinks" key which, if present,
should be an array of links for brozzler to handle as outlinks.

I've retained backwards compatibility by checking to see if the
returned object is a boolean and handling it like we did previously.
2025-12-19 12:05:23 -08:00
Misty De Méo
eec9234264 browser: correctly set outlinks type
We inconsistently assigned this to a `list` if we skipped outlink
extraction instead of a `frozenset` like we do when actually
fetching them. I've annotated the method that returns it for
clarity.
2025-12-19 09:26:20 -08:00
Misty De Méo
33fffdfefd outlinks: simplify outlink parsing
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
The outlinks are collected as HTMLAnchorElement objects. The previous
version handled stringifying them by collecting the entire set of
objects into a single newline-delimited string, then splitting it back
up again in Python. It seems easier to just send back a JSON array of
strings and have Python iterate over them that way.
2025-12-17 16:49:41 -08:00
Misty De Méo
93bb1a9a35 behaviors: add loop control selector
Some checks are pending
Python Formatting Check / formatting (push) Waiting to run
Tests / Run tests (push) Waiting to run
Right now, our loop config in behaviours is purely based on the loop
limit. We can be smarter than that, though - sometimes we can know
when it's safe to *stop* clicking the same element over and over by
running another selector.

A good example of this is page pagination. Most sites with
JavaScript-based pagination will also disable the "next" button once
the browser has reached the last page of results. Using that, we can
configure a behaviour that looks something like this:

```yaml
- url_regex: '^.*$'
  behavior_js_template: umbraBehavior.js.j2
  request_idle_timeout_sec: 10
  default_parameters:
    interval: 10
    actions:
      - selector: .next
        repeatSameElement: true
        repeatUntilSelector: .next[disabled]
        # the repeatUntil selector makes long loops safe
        # since we'll break out of the loop once we hit the
        # disabled button!
        limit: 1000
```
2025-12-17 13:24:51 -08:00
Misty De Méo
bf692bf144 browser: make outlink gathering reusable
The outlink JS always executes its function, which means we can't safely
inject it if we want other JavaScript to be able to execute outlink
extraction. This updates it to instead expose a single outlink gathering
function, which is then explicitly  called by the outlink gatherer after
injecting the JS.
2025-12-17 13:15:28 -08:00
Misty De Méo
47f5c06ee4 deps: urllib3 2.6.1 / brotli 1.2.0 / requests 2.32.5
Some checks failed
Tests / Run tests (push) Has been cancelled
Python Formatting Check / formatting (push) Has been cancelled
2025-12-11 09:34:26 -08:00
Alex Dempsey
4e2098009f
Merge pull request #424 from vbanos/get-version-opt
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
Optimize check_version
2025-12-02 14:38:34 -08:00
vbanos
7bf52b0b79 Optimize check_version
Every time we create a new `Chrome` instance, we run `check_version`
which executes a subprocess that runs `chrome_exe --version`.

Chrome version changes very rarely and when it does, we restart Brozzler anyway.

Since we create a new `Chrome` instance every time we run a fresh
`Browser` (which may be a lot), it makes sense to cache `check_version`.
2025-12-02 22:45:30 +01:00
Alex Dempsey
d12ed3af6a
Merge pull request #420 from vbanos/screenshot-limits
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
Limit screenshot width and height
2025-11-19 10:44:26 -08:00
vbanos
63dd934fa7 Set default height to 20k and fix formatting 2025-11-19 13:24:16 +01:00
Adam Miller
c7f179438c
Merge pull request #423 from internetarchive/misty/release_1_8_1
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
release: 1.8.1
2025-11-13 16:45:45 -08:00
Misty De Méo
771f553572 release: 1.8.1 2025-11-13 15:37:51 -08:00
Adam Miller
78263af2f7
Merge pull request #422 from internetarchive/adam/fix_claim_sites_pre_filter
fix: We were applying the max_sites_to_claim filter too early. Many s…
2025-11-13 11:30:20 -08:00
Adam Miller
4430605ef1 chore: address ci formatting 2025-11-13 11:19:04 -08:00
Adam Miller
96942d40f9 fix: We were applying the max_sites_to_claim filter too early. Many sites in a single crawl prevent claimable sites from getting through. 2025-11-13 11:14:11 -08:00
vbanos
d5b66b8e78 Limit screenshot width and height
There are cases where screenshot width / height become huge. We need to
limit them to avoid system overload.

Set default limits of width=2k and height=10k pixels.

Define `Browser` params to override limits when necessary.
2025-11-10 12:27:06 +01:00
Misty De Méo
4d1fb31bc6 ci: install deno
Some checks failed
Tests / Run tests (push) Has been cancelled
Python Formatting Check / formatting (push) Has been cancelled
2025-11-07 14:57:42 -08:00
Misty De Méo
9a47de68ee deps: bump minimum python to 3.10
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
3.9 is now EOL, and yt-dlp no longer supports it.
2025-10-27 12:47:28 -07:00
Misty De Méo
1678d163d4 deps: bump pluggy
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
Fixes a warning under Python 3.14.
https://docs.python.org/3.14/whatsnew/3.14.html#pep-765-control-flow-in-finally-blocks
2025-10-08 14:08:00 -07:00
Misty De Méo
44484583cb release: 1.8.0
Some checks are pending
Python Formatting Check / formatting (push) Waiting to run
Tests / Run tests (push) Waiting to run
2025-10-07 13:42:16 -07:00
Misty De Méo
3587f6e486 pyproject: add Misty 2025-10-07 13:42:16 -07:00
Misty De Méo
04d06ca49e deps: bump locked yt-dlp
The locked version hasn't been upgraded for awhile.
2025-10-07 12:06:55 -07:00
Barbara Miller
ba9e4f1be7
Merge pull request #367 from galgeek/barbara/header_request_timeout_60
increase HEADER_REQUEST_TIMEOUT
2025-10-07 12:00:15 -07:00
Barbara Miller
35153266a1
Merge branch 'master' into barbara/header_request_timeout_60 2025-10-07 11:00:39 -07:00
Misty De Méo
98a829f269 ssl: allow fetching pages needing legacy renegotiation
Some checks are pending
Python Formatting Check / formatting (push) Waiting to run
Tests / Run tests (push) Waiting to run
Unsafe legacy renegotiation is disabled by default in requests, but
it's needed to access some webpages that real browsers are able to
safely access. This leaves it disabled by default when fetching
headers, while logging and retrying with it enabled if that fails.
robots.txt fetching is always done with legacy renegotiation on.
2025-10-06 09:28:36 -07:00
TheTechRobo
89d06af104 Fix workaround
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
2025-08-28 15:25:59 -03:00
TheTechRobo
a19d8c7814 Add test for network monitoring 2025-08-28 15:25:59 -03:00
TheTechRobo
a5b2ecbda9 Track network activity and wait for idle when visiting hashtags 2025-08-28 15:25:59 -03:00
Misty De Méo
3e82a55207 ci: migrate yt-dlp autotest to renovate
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
This replaces the previous yt-dlp auto-test and merge workflow to use
Renovate instead of Dependabot, since we've found that Dependabot is
no longer able to update our dependencies.
2025-08-21 13:25:29 -07:00
renovate[bot]
2b0bc419c0 chore(config): migrate config renovate.json 2025-08-21 13:14:07 -07:00
Misty De Méo
10db4bd19f renovate: disable everything but yt-dlp 2025-08-21 12:39:20 -07:00
Misty De Méo
9b7999989b renovate: customizations 2025-08-21 12:39:20 -07:00
renovate[bot]
b889cedf64 Add renovate.json 2025-08-21 12:39:20 -07:00
Misty De Méo
972b816878 deps: warctools 5.0.1
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
Silences a noisy warning; no other changes.
2025-08-18 15:32:43 -07:00
Misty De Méo
6261ea15ad tests: add some silenced warnings
These come from a dependency we can't affect right now.
2025-08-18 15:20:12 -07:00
Misty De Méo
940dadfc12 worker: add missing import
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
2025-07-30 14:17:30 -07:00
Misty De Méo
5ee31cd879 browser: fix json separators 2025-07-30 14:17:30 -07:00
TheTechRobo
08bb09ff06 Add --no-headless option to brozzle-page and brozzler-worker CLI
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
2025-07-28 15:04:00 -07:00
TheTechRobo
7d7968e833 Add headless option to Chrome.start 2025-07-28 15:04:00 -07:00
Misty De Méo
f719b61983 docs: bump README copyright year
Some checks are pending
Python Formatting Check / formatting (push) Waiting to run
Tests / Run tests (push) Waiting to run
2025-07-28 14:19:46 -07:00
Misty De Méo
43b7e57147 docs: remove outdated README comment 2025-07-28 14:19:46 -07:00
Misty De Méo
4c77515063 deps: warctools 5.0.0
Some checks failed
Python Formatting Check / formatting (push) Has been cancelled
Tests / Run tests (push) Has been cancelled
Needed for the warcprox import to work.
2025-07-21 12:40:11 -07:00
Misty De Méo
99575b03b4 ci: always run full test suite
We previously ran the full suite, including test_brozzling, on a daily
timer because it took an enormous amount of time to run. I'd been under
the impression this was because it *had* to take that long to do the
work it was performing, but it looks like it hadn't been necessary and
the suite has been sped up massively since. We can now run it in about
six and a half minutes, which is perfectly fine to run on every PR.
2025-07-21 12:40:11 -07:00