Federation catch up mode is very inefficient if the number of events
that the remote server has missed is small, since handling gaps can be
very expensive, c.f. #9492.
Instead of going into catch up mode whenever we see an error, we instead
do so only if we've backed off from trying the remote for more than an
hour (the assumption being that in such a case it is more than a
transient failure).
Background: When we receive incoming federation traffic, and notice that we are missing prev_events from
the incoming traffic, first we do a `/get_missing_events` request, and then if we still have missing prev_events,
we set up new backwards-extremities. To do that, we need to make a `/state_ids` request to ask the remote
server for the state at those prev_events, and then we may need to then ask the remote server for any events
in that state which we don't already have, as well as the auth events for those missing state events, so that we
can auth them.
This PR attempts to optimise the processing of that state request. The `state_ids` API returns a list of the state
events, as well as a list of all the auth events for *all* of those state events. The optimisation comes from the
observation that we are currently loading all of those auth events into memory at the start of the operation, but
we almost certainly aren't going to need *all* of the auth events. Rather, we can check that we have them, and
leave the actual load into memory for later. (Ideally the federation API would tell us which auth events we're
actually going to need, but it doesn't.)
The effect of this is to reduce the number of events that I need to load for an event in Matrix HQ from about
60000 to about 22000, which means it can stay in my in-memory cache, whereas previously the sheer number
of events meant that all 60K events had to be loaded from db for each request, due to the amount of cache
churn. (NB I've already tripled the size of the cache from its default of 10K).
Unfortunately I've ended up basically C&Ping `_get_state_for_room` and `_get_events_from_store_or_dest` into
a new method, because `_get_state_for_room` is also called during backfill, which expects the auth events to be
returned, so the same tricks don't work. That said, I don't really know why that codepath is completely different
(ultimately we're doing the same thing in setting up a new backwards extremity) so I've left a TODO suggesting
that we clean it up.
We either need to pass the auth provider over the replication api, or make sure
we report the auth provider on the worker that received the request. I've gone
with the latter.
Earlier [I was convinced](https://github.com/matrix-org/synapse/issues/9565) that we didn't have an Admin API for listing media uploaded by a user. Foolishly I was looking under the Media Admin API documentation, instead of the User Admin API documentation.
I thought it'd be helpful to link to the latter so others don't hit the same dead end :)
The hashes are from commits due to auto-formatting, e.g. running black.
git can be configured to use this automatically by running the following:
git config blame.ignoreRevsFile .git-blame-ignore-revs
I noticed that I'd occasionally have `scripts-dev/lint.sh` fail when messing about with config options in my PR. The script calls `scripts-dev/config-lint.sh`, which attempts some validation on the sample config.
It does this by using `sed` to edit the sample_config, and then seeing if the file changed using `git diff`.
The problem is: if you changed the sample_config as part of your commit, this script will error regardless.
This PR attempts to change the check so that existing, unstaged changes to the sample_config will not cause the script to report an invalid file.
Unfortunately this doesn't test re-joining the room since
that requires having another homeserver to query over
federation, which isn't easily doable in unit tests.
This great big stack of commits is a a whole load of hoop-jumping to make it easier to store additional values in login tokens, and then to actually store the SSO Identity Provider in the login token. (Making use of that data will follow in a subsequent PR.)
Turns out matrix.org has an event that has duplicate auth events (which really isn't supposed to happen, but here we are). This caused the background update to fail due to `UniqueViolation`.