* Make lock better handle process being killed
If the process gets killed and restarted (so that it didn't have a
chance to drop its locks gracefully) then there may still be locks in
the DB that are for the same instance that haven't yet timed out but are
safe to delete.
We handle this case by a) checking if the current instance already has
taken out the lock, and b) if not then ignoring locks that are for the
same instance.
* Periodically check for old staged events
This is to protect against other instances dying and their locks timing
out.
This fixes a "Event not signed by authorising server" error when
transition room member from join -> join, e.g. when updating a
display name or avatar URL for restricted rooms.
* Factor more stuff out of `_get_events_and_persist`
It turns out that the event-sorting algorithm in `_get_events_and_persist` is
also useful in other circumstances. Here we move the current
`_auth_and_persist_fetched_events` to `_auth_and_persist_fetched_events_inner`,
and then factor the sorting part out to `_auth_and_persist_fetched_events`.
* `_get_remote_auth_chain_for_event`: remove redundant `outlier` assignment
`get_event_auth` returns events with the outlier flag already set, so this is
redundant (though we need to update a test where `get_event_auth` is mocked).
* `_get_remote_auth_chain_for_event`: move existing-event tests earlier
Move a couple of tests outside the loop. This is a bit inefficient for now, but
a future commit will make it better. It should be functionally identical.
* `_get_remote_auth_chain_for_event`: use `_auth_and_persist_fetched_events`
We can use the same codepath for persisting the events fetched as part of an
auth chain as for those fetched individually by `_get_events_and_persist` for
building the state at a backwards extremity.
* `_get_remote_auth_chain_for_event`: use a dict for efficiency
`_auth_and_persist_fetched_events` sorts the events itself, so we no longer
need to care about maintaining the ordering from `get_event_auth` (and no
longer need to sort by depth in `get_event_auth`).
That means that we can use a map, making it easier to filter out events we
already have, etc.
* changelog
* `_auth_and_persist_fetched_events`: improve docstring
Here we split on_receive_pdu into two functions (on_receive_pdu and process_pulled_event), rather than having both cases in the same method. There's a tiny bit of overlap, but not that much.
If the new /hierarchy API does not exist on all destinations,
fallback to querying the /spaces API and translating the results.
This is a backwards compatibility hack since not all of the
federated homeservers will update at the same time.
* Include outlier status in `str(event)`
In places where we log event objects, knowing whether or not you're dealing
with an outlier is super useful.
* Remove duplicated logging in get_missing_events
When we process events received from get_missing_events, we log them twice
(once in `_get_missing_events_for_pdu`, and once in `on_receive_pdu`). Reduce
the duplication by removing the logging in `on_receive_pdu`, and ensuring the
call sites do sensible logging.
* log in `on_receive_pdu` when we already have the event
* Log which prev_events we are missing
* changelog
Instead of wrapping the JSON into an object, this creates concrete
instances for Transaction and Edu. This allows for improved type
hints and simplified code.
If the federation client receives an M_UNABLE_TO_AUTHORISE_JOIN or
M_UNABLE_TO_GRANT_JOIN response it will attempt another server
before giving up completely.
Improves type hints for:
* parse_{boolean,integer}
* parse_{boolean,integer}_from_args
* parse_json_{value,object}_from_request
And fixes any incorrect calls that resulted from unknown types.
This is to help with performance, where trying to connect to thousands
of hosts at once can consume a lot of CPU (due to TLS etc).
Co-authored-by: Brendan Abolivier <babolivier@matrix.org>
Fixes#9490
This will break a couple of SyTest that are expecting failures to be added to the response of a federation /send, which obviously doesn't happen now that things are asynchronous.
Two drawbacks:
Currently there is no logic to handle any events left in the staging area after restart, and so they'll only be handled on the next incoming event in that room. That can be fixed separately.
We now only process one event per room at a time. This can be fixed up further down the line.
The idea here is to stop people sending things that aren't joins/leaves/knocks through these endpoints: previously you could send anything you liked through them. I wasn't able to find any security holes from doing so, but it doesn't sound like a good thing.