brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2026-01-07 12:05:32 -05:00

Author	SHA1	Message	Date
Misty De Méo	9d0c61fee6	deps: yt-dlp 2025.12.08 Some checks are pending Python Formatting Check / formatting (push) Waiting to run Details Tests / Run tests (push) Waiting to run Details	2026-01-06 14:58:28 -08:00
Adam Miller	b74d7d6925	Merge pull request #438 from internetarchive/adam/handle-chrome-error-pages fix: attempting to detect chrome error pages and trigger retry logic	2026-01-06 13:35:56 -08:00
Adam Miller	7512c84e67	fix: reverting retry count change	2026-01-06 13:20:28 -08:00
Adam Miller	d0ef974313	chore: log chrome error page, and fix retry attempt comparison	2026-01-06 11:40:31 -08:00
Adam Miller	85267b8edb	chore: handle PageConnectionError in CLI mode	2026-01-06 11:39:50 -08:00
Adam Miller	991f1741fa	fix: attempting to detect chrome error pages and trigger retry logic	2025-12-23 15:06:03 -08:00
Misty De Méo	7e9d7f1130	behaviors: optionally collect outlinks while looping Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details This solves one of the existing limitations of our outlink gathering system: we run it once, after all behaviours have completed. For some interactive pages, particularly single-page apps with paginated data, it means that we'll completely miss that content since it won't be in the DOM anymore by the time we get around to it. In a previous PR, #433, I made the outlink gathering logic reusable by ensuring it's possible for us to dynamically call the outlink gathering function at any time. In addition, in #429, I made it possible for behaviours to return outlinks to Brozzler; if the behaviour chooses to return outlinks, then Brozzler will add them to the set it extracts after behavours complete. This branch uses both of those by introducing new functionality in behaviours. We always inject the outlink gathering code before running behaviours, so they can now run it at will. It also introduces two new behaviour parameters: * `extractOutlinks` - If set to `true`, then the behaviour script will call the outlink gathering logic and return any outlinks it harvested to Brozzler. Defaults to `false`. * `extractOutlinksInLoop` - If set to `true`, then the behaviour script will gather outlinks every iteration of the loop. This combines great with `repeatSameElement`, since it means the behaviour script can click a `next` pagination button and then immediately gather whatever new outlinks have appeared on the page. Defaults to `false`.	2025-12-19 12:14:47 -08:00
Misty De Méo	00edb56aa1	behaviors: allow extracting outlinks Currently, our JavaScript outlink extraction happens purely via our non-configurable extract-outlinks.js script. However, given that many sites can have unpredictable behaviour we may want special handling for, it would be great to let us configure this on a per- site basis. We already have a system for this for interacting with sites using our behaviour system; if we expand this to also provide outlinks, we can give ourselves a much more flexible system to handle complex or special-case websites. This extends the behaviour system so that we can now return a JavaScript object with information about the site. That object should contain at least the "finished" key, which is a boolean that works like the simple boolean returned by older versions. The object can additionally contain an "outlinks" key which, if present, should be an array of links for brozzler to handle as outlinks. I've retained backwards compatibility by checking to see if the returned object is a boolean and handling it like we did previously.	2025-12-19 12:05:23 -08:00
Misty De Méo	eec9234264	browser: correctly set outlinks type We inconsistently assigned this to a `list` if we skipped outlink extraction instead of a `frozenset` like we do when actually fetching them. I've annotated the method that returns it for clarity.	2025-12-19 09:26:20 -08:00
Misty De Méo	33fffdfefd	outlinks: simplify outlink parsing Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details The outlinks are collected as HTMLAnchorElement objects. The previous version handled stringifying them by collecting the entire set of objects into a single newline-delimited string, then splitting it back up again in Python. It seems easier to just send back a JSON array of strings and have Python iterate over them that way.	2025-12-17 16:49:41 -08:00
Misty De Méo	93bb1a9a35	behaviors: add loop control selector Some checks are pending Python Formatting Check / formatting (push) Waiting to run Details Tests / Run tests (push) Waiting to run Details Right now, our loop config in behaviours is purely based on the loop limit. We can be smarter than that, though - sometimes we can know when it's safe to stop clicking the same element over and over by running another selector. A good example of this is page pagination. Most sites with JavaScript-based pagination will also disable the "next" button once the browser has reached the last page of results. Using that, we can configure a behaviour that looks something like this: ```yaml - url_regex: '^.*$' behavior_js_template: umbraBehavior.js.j2 request_idle_timeout_sec: 10 default_parameters: interval: 10 actions: - selector: .next repeatSameElement: true repeatUntilSelector: .next[disabled] # the repeatUntil selector makes long loops safe # since we'll break out of the loop once we hit the # disabled button! limit: 1000 ```	2025-12-17 13:24:51 -08:00
Misty De Méo	bf692bf144	browser: make outlink gathering reusable The outlink JS always executes its function, which means we can't safely inject it if we want other JavaScript to be able to execute outlink extraction. This updates it to instead expose a single outlink gathering function, which is then explicitly called by the outlink gatherer after injecting the JS.	2025-12-17 13:15:28 -08:00
Misty De Méo	47f5c06ee4	deps: urllib3 2.6.1 / brotli 1.2.0 / requests 2.32.5 Some checks failed Tests / Run tests (push) Has been cancelled Details Python Formatting Check / formatting (push) Has been cancelled Details	2025-12-11 09:34:26 -08:00
Alex Dempsey	4e2098009f	Merge pull request #424 from vbanos/get-version-opt Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details Optimize check_version	2025-12-02 14:38:34 -08:00
vbanos	7bf52b0b79	Optimize check_version Every time we create a new `Chrome` instance, we run `check_version` which executes a subprocess that runs `chrome_exe --version`. Chrome version changes very rarely and when it does, we restart Brozzler anyway. Since we create a new `Chrome` instance every time we run a fresh `Browser` (which may be a lot), it makes sense to cache `check_version`.	2025-12-02 22:45:30 +01:00
Alex Dempsey	d12ed3af6a	Merge pull request #420 from vbanos/screenshot-limits Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details Limit screenshot width and height	2025-11-19 10:44:26 -08:00
vbanos	63dd934fa7	Set default height to 20k and fix formatting	2025-11-19 13:24:16 +01:00
Adam Miller	c7f179438c	Merge pull request #423 from internetarchive/misty/release_1_8_1 Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details release: 1.8.1	2025-11-13 16:45:45 -08:00
Misty De Méo	771f553572	release: 1.8.1	2025-11-13 15:37:51 -08:00
Adam Miller	78263af2f7	Merge pull request #422 from internetarchive/adam/fix_claim_sites_pre_filter fix: We were applying the max_sites_to_claim filter too early. Many s…	2025-11-13 11:30:20 -08:00
Adam Miller	4430605ef1	chore: address ci formatting	2025-11-13 11:19:04 -08:00
Adam Miller	96942d40f9	fix: We were applying the max_sites_to_claim filter too early. Many sites in a single crawl prevent claimable sites from getting through.	2025-11-13 11:14:11 -08:00
vbanos	d5b66b8e78	Limit screenshot width and height There are cases where screenshot width / height become huge. We need to limit them to avoid system overload. Set default limits of width=2k and height=10k pixels. Define `Browser` params to override limits when necessary.	2025-11-10 12:27:06 +01:00
Misty De Méo	4d1fb31bc6	ci: install deno Some checks failed Tests / Run tests (push) Has been cancelled Details Python Formatting Check / formatting (push) Has been cancelled Details	2025-11-07 14:57:42 -08:00
Misty De Méo	9a47de68ee	deps: bump minimum python to 3.10 Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details 3.9 is now EOL, and yt-dlp no longer supports it.	2025-10-27 12:47:28 -07:00
Misty De Méo	1678d163d4	deps: bump pluggy Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details Fixes a warning under Python 3.14. https://docs.python.org/3.14/whatsnew/3.14.html#pep-765-control-flow-in-finally-blocks	2025-10-08 14:08:00 -07:00
Misty De Méo	44484583cb	release: 1.8.0 Some checks are pending Python Formatting Check / formatting (push) Waiting to run Details Tests / Run tests (push) Waiting to run Details	2025-10-07 13:42:16 -07:00
Misty De Méo	3587f6e486	pyproject: add Misty	2025-10-07 13:42:16 -07:00
Misty De Méo	04d06ca49e	deps: bump locked yt-dlp The locked version hasn't been upgraded for awhile.	2025-10-07 12:06:55 -07:00
Barbara Miller	ba9e4f1be7	Merge pull request #367 from galgeek/barbara/header_request_timeout_60 increase HEADER_REQUEST_TIMEOUT	2025-10-07 12:00:15 -07:00
Barbara Miller	35153266a1	Merge branch 'master' into barbara/header_request_timeout_60	2025-10-07 11:00:39 -07:00
Misty De Méo	98a829f269	ssl: allow fetching pages needing legacy renegotiation Some checks are pending Python Formatting Check / formatting (push) Waiting to run Details Tests / Run tests (push) Waiting to run Details Unsafe legacy renegotiation is disabled by default in requests, but it's needed to access some webpages that real browsers are able to safely access. This leaves it disabled by default when fetching headers, while logging and retrying with it enabled if that fails. robots.txt fetching is always done with legacy renegotiation on.	2025-10-06 09:28:36 -07:00
TheTechRobo	89d06af104	Fix workaround Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details	2025-08-28 15:25:59 -03:00
TheTechRobo	a19d8c7814	Add test for network monitoring	2025-08-28 15:25:59 -03:00
TheTechRobo	a5b2ecbda9	Track network activity and wait for idle when visiting hashtags	2025-08-28 15:25:59 -03:00
Misty De Méo	3e82a55207	ci: migrate yt-dlp autotest to renovate Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details This replaces the previous yt-dlp auto-test and merge workflow to use Renovate instead of Dependabot, since we've found that Dependabot is no longer able to update our dependencies.	2025-08-21 13:25:29 -07:00
renovate[bot]	2b0bc419c0	chore(config): migrate config renovate.json	2025-08-21 13:14:07 -07:00
Misty De Méo	10db4bd19f	renovate: disable everything but yt-dlp	2025-08-21 12:39:20 -07:00
Misty De Méo	9b7999989b	renovate: customizations	2025-08-21 12:39:20 -07:00
renovate[bot]	b889cedf64	Add renovate.json	2025-08-21 12:39:20 -07:00
Misty De Méo	972b816878	deps: warctools 5.0.1 Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details Silences a noisy warning; no other changes.	2025-08-18 15:32:43 -07:00
Misty De Méo	6261ea15ad	tests: add some silenced warnings These come from a dependency we can't affect right now.	2025-08-18 15:20:12 -07:00
Misty De Méo	940dadfc12	worker: add missing import Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details	2025-07-30 14:17:30 -07:00
Misty De Méo	5ee31cd879	browser: fix json separators	2025-07-30 14:17:30 -07:00
TheTechRobo	08bb09ff06	Add `--no-headless` option to brozzle-page and brozzler-worker CLI Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details	2025-07-28 15:04:00 -07:00
TheTechRobo	7d7968e833	Add `headless` option to Chrome.start	2025-07-28 15:04:00 -07:00
Misty De Méo	f719b61983	docs: bump README copyright year Some checks are pending Python Formatting Check / formatting (push) Waiting to run Details Tests / Run tests (push) Waiting to run Details	2025-07-28 14:19:46 -07:00
Misty De Méo	43b7e57147	docs: remove outdated README comment	2025-07-28 14:19:46 -07:00
Misty De Méo	4c77515063	deps: warctools 5.0.0 Some checks failed Python Formatting Check / formatting (push) Has been cancelled Details Tests / Run tests (push) Has been cancelled Details Needed for the warcprox import to work.	2025-07-21 12:40:11 -07:00
Misty De Méo	99575b03b4	ci: always run full test suite We previously ran the full suite, including test_brozzling, on a daily timer because it took an enormous amount of time to run. I'd been under the impression this was because it had to take that long to do the work it was performing, but it looks like it hadn't been necessary and the suite has been sped up massively since. We can now run it in about six and a half minutes, which is perfectly fine to run on every PR.	2025-07-21 12:40:11 -07:00

1 2 3 4 5 ...

1748 commits