brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-12-25 21:34:52 -05:00

Author	SHA1	Message	Date
Misty De Méo	2f623f7e6d	behaviors: optionally collect outlinks while looping This solves one of the existing limitations of our outlink gathering system: we run it once, after all behaviours have completed. For some interactive pages, particularly single-page apps with paginated data, it means that we'll completely miss that content since it won't be in the DOM anymore by the time we get around to it. In a previous PR, #433, I made the outlink gathering logic reusable by ensuring it's possible for us to dynamically call the outlink gathering function at any time. In addition, in #429, I made it possible for behaviours to return outlinks to Brozzler; if the behaviour chooses to return outlinks, then Brozzler will add them to the set it extracts after behavours complete. This branch uses both of those by introducing new functionality in behaviours. We always inject the outlink gathering code before running behaviours, so they can now run it at will. It also introduces two new behaviour parameters: * `extractOutlinks` - If set to `true`, then the behaviour script will call the outlink gathering logic and return any outlinks it harvested to Brozzler. Defaults to `false`. * `extractOutlinksInLoop` - If set to `true`, then the behaviour script will gather outlinks every iteration of the loop. This combines great with `repeatSameElement`, since it means the behaviour script can click a `next` pagination button and then immediately gather whatever new outlinks have appeared on the page. Defaults to `false`.	2025-12-19 12:16:07 -08:00
Misty De Méo	b6078ceee7	behaviors: allow extracting outlinks Currently, our JavaScript outlink extraction happens purely via our non-configurable extract-outlinks.js script. However, given that many sites can have unpredictable behaviour we may want special handling for, it would be great to let us configure this on a per- site basis. We already have a system for this for interacting with sites using our behaviour system; if we expand this to also provide outlinks, we can give ourselves a much more flexible system to handle complex or special-case websites. This extends the behaviour system so that we can now return a JavaScript object with information about the site. That object should contain at least the "finished" key, which is a boolean that works like the simple boolean returned by older versions. The object can additionally contain an "outlinks" key which, if present, should be an array of links for brozzler to handle as outlinks. I've retained backwards compatibility by checking to see if the returned object is a boolean and handling it like we did previously.	2025-12-19 12:16:02 -08:00
Misty De Méo	874163beec	browser: correctly set outlinks type We inconsistently assigned this to a `list` if we skipped outlink extraction instead of a `frozenset` like we do when actually fetching them. I've annotated the method that returns it for clarity.	2025-12-19 12:15:58 -08:00
Misty De Méo	ae779eefa8	outlinks: simplify outlink parsing The outlinks are collected as HTMLAnchorElement objects. The previous version handled stringifying them by collecting the entire set of objects into a single newline-delimited string, then splitting it back up again in Python. It seems easier to just send back a JSON array of strings and have Python iterate over them that way.	2025-12-19 12:15:49 -08:00
Misty De Méo	9da38cc40b	behaviors: add loop control selector Right now, our loop config in behaviours is purely based on the loop limit. We can be smarter than that, though - sometimes we can know when it's safe to stop clicking the same element over and over by running another selector. A good example of this is page pagination. Most sites with JavaScript-based pagination will also disable the "next" button once the browser has reached the last page of results. Using that, we can configure a behaviour that looks something like this: ```yaml - url_regex: '^.*$' behavior_js_template: umbraBehavior.js.j2 request_idle_timeout_sec: 10 default_parameters: interval: 10 actions: - selector: .next repeatSameElement: true repeatUntilSelector: .next[disabled] # the repeatUntil selector makes long loops safe # since we'll break out of the loop once we hit the # disabled button! limit: 1000 ```	2025-12-19 12:15:45 -08:00
Misty De Méo	cca8eea285	browser: make outlink gathering reusable The outlink JS always executes its function, which means we can't safely inject it if we want other JavaScript to be able to execute outlink extraction. This updates it to instead expose a single outlink gathering function, which is then explicitly called by the outlink gatherer after injecting the JS.	2025-12-19 12:15:41 -08:00
Misty De Méo	dc2873a3bc	deps: urllib3 2.6.1 / brotli 1.2.0 / requests 2.32.5	2025-12-19 12:15:30 -08:00
renovate[bot]	2ef65ca0dd	fix(deps): update dependency yt-dlp to v2025.12.8	2025-12-08 00:44:01 +00:00
Adam Miller	aa36cb930b	chore: address ci formatting	2025-11-13 15:47:11 -08:00
Adam Miller	58ba082a21	fix: We were applying the max_sites_to_claim filter too early. Many sites in a single crawl prevent claimable sites from getting through.	2025-11-13 15:46:59 -08:00
renovate[bot]	5222fdacdc	fix(deps): update dependency yt-dlp to v2025.11.12	2025-11-12 03:49:54 +00:00
Misty De Méo	3757d0681a	ci: install deno	2025-11-07 14:48:15 -08:00
renovate[bot]	0f8ea3cb6d	fix(deps): update dependency yt-dlp to v2025.10.22	2025-10-27 20:09:24 +00:00
Misty De Méo	90a8dc5ebe	deps: bump minimum python to 3.10 3.9 is now EOL, and yt-dlp no longer supports it.	2025-10-27 12:57:35 -07:00
renovate[bot]	47d8d4d19e	fix(deps): update dependency yt-dlp to v2025.10.14	2025-10-15 02:00:38 +00:00
Misty De Méo	da8b00016a	ssl: allow fetching pages needing legacy renegotiation Unsafe legacy renegotiation is disabled by default in requests, but it's needed to access some webpages that real browsers are able to safely access. This leaves it disabled by default when fetching headers, while logging and retrying with it enabled if that fails. robots.txt fetching is always done with legacy renegotiation on.	2025-10-02 10:54:00 -07:00
renovate[bot]	4aeca4ad91	fix(deps): update dependency yt-dlp to v2025.9.26	2025-09-27 00:38:46 +00:00
renovate[bot]	31cb217751	fix(deps): update dependency yt-dlp to v2025.9.23	2025-09-23 10:40:06 +00:00
renovate[bot]	87851d1ba1	fix(deps): update dependency yt-dlp to v2025.9.5	2025-09-06 16:06:14 +00:00
renovate[bot]	61d57e63c3	fix(deps): update dependency yt-dlp to v2025.8.27	2025-08-28 02:07:23 +00:00
Barbara Miller	cdd4d6500a	bump qa version	2025-08-27 13:56:56 -07:00
Barbara Miller	fcf966fca4	Merge branch 'predup_type_playlist' into qa	2025-08-27 13:56:16 -07:00
Barbara Miller	cdf5db4f47	self._video_data = None sometimes	2025-08-27 13:37:05 -07:00
Barbara Miller	7c93eb7a8d	if self._video_data	2025-08-27 11:36:40 -07:00
Barbara Miller	09fd339447	should_ytdlp predup only when worker.video_data	2025-08-27 11:27:17 -07:00
Barbara Miller	70a8837673	Merge branch 'predup_type_playlist' into qa	2025-08-27 11:00:21 -07:00
Barbara Miller	bade30b2ff	ruff'd (added final comma 8\|)	2025-08-25 11:34:32 -07:00
Barbara Miller	fef6622cb1	bump qa version	2025-08-25 11:24:44 -07:00
Barbara Miller	29549d2255	Merge branch 'predup_type_playlist' into qa	2025-08-25 11:23:45 -07:00
Barbara Miller	cd8ffa2e35	bump qa version	2025-08-25 10:52:13 -07:00
Barbara Miller	fc7fa6d0ae	fix last buglet & tweak logging	2025-08-25 10:42:02 -07:00
renovate[bot]	f85e44a011	fix(deps): update dependency yt-dlp to v2025.8.22	2025-08-24 13:23:33 +00:00
renovate[bot]	252ee93319	chore(deps): update dependency yt-dlp to v2025.8.22	2025-08-23 01:55:49 +00:00
Barbara Miller	c8480ff8ab	Merge branch 'predup_type_playlist' into qa	2025-08-22 18:18:50 -07:00
Barbara Miller	b7949b5d60	ruff'd	2025-08-22 18:04:49 -07:00
Barbara Miller	cd658907b3	ruff'd	2025-08-22 18:03:43 -07:00
Barbara Miller	a4f7ab5ddf	Merge branch 'predup_type_playlist' into qa	2025-08-22 17:05:29 -07:00
Barbara Miller	ae67bcd4a1	fix buglets	2025-08-22 17:02:03 -07:00
Misty De Méo	3cd98d8944	deps: update lockfile	2025-08-22 13:29:58 -07:00
renovate[bot]	609f923d32	chore(deps): update dependency yt-dlp to v2025.8.20	2025-08-21 20:01:37 +00:00
Misty De Méo	e2eff5ebed	ci: fix branch check	2025-08-21 12:49:27 -07:00
Misty De Méo	c5c44db119	ci: update renovate branch format	2025-08-21 12:41:45 -07:00
Misty De Méo	13328883b3	ci: migrate yt-dlp autotest to renovate This replaces the previous yt-dlp auto-test and merge workflow to use Renovate instead of Dependabot, since we've found that Dependabot is no longer able to update our dependencies.	2025-08-21 12:38:59 -07:00
Misty De Méo	07b3c5e28d	deps: pin yt-dlp to specific version	2025-08-21 11:38:57 -07:00
Barbara Miller	f091f51d48	bump qa version	2025-08-21 10:18:25 -07:00
Barbara Miller	6fbebdc0e9	Merge branch 'predup_type_playlist' into qa	2025-08-21 10:17:48 -07:00
Barbara Miller	67d0ac98c8	more better result handling for recent_video_capture_exists	2025-08-21 10:17:20 -07:00
Barbara Miller	d599a1aa1c	bump qa version	2025-08-20 14:01:21 -07:00
Barbara Miller	9575df7b1c	Merge branch 'predup_type_playlist' into qa	2025-08-20 14:00:50 -07:00
Barbara Miller	978858327b	if result_tuple[0]	2025-08-20 13:56:08 -07:00

1 2 3 4 5 ...

1873 commits