Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).
There are 3 ways we currently calculate hosts that are in the room:
1. `get_current_state` -> `get_domains_from_state`
- Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
- This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳
1. `get_current_hosts_in_room`
- Used for other federation things like sending read receipts and typing indicators
1. `get_hosts_in_room_at_events`
- Used when pushing out events over federation to other servers in the `_process_event_queue_loop`
Fix https://github.com/matrix-org/synapse/issues/13626
Part of https://github.com/matrix-org/synapse/issues/13356
Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)
### Query performance
#### Before
The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
```
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 16035.612 ms (00:16.036)
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 4243.237 ms (00:04.243)
```
But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.
```sh
$ psql synapse
synapse=# \timing on
synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
FROM current_state_events
WHERE
type = 'm.room.member'
AND membership = 'join'
AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
count
-------
4130
(1 row)
Time: 13181.598 ms (00:13.182)
synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
count
-------
80814
synapse=# SELECT COUNT(*) from current_state_events;
count
---------
8162847
synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
pg_size_pretty
----------------
4702 MB
```
#### After
I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.
After a Postgres restart, it takes 330ms and running back to back takes 260ms.
```sh
$ psql synapse
synapse=# \timing on
Timing is on.
synapse=# SELECT
substring(c.state_key FROM '@[^:]*:(.*)$') as host
FROM current_state_events c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE
c.type = 'm.room.member'
AND c.membership = 'join'
AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
GROUP BY host
ORDER BY min(e.depth) ASC;
Time: 333.800 ms
```
#### Going further
To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up.
Another thing we can do is optimize the query to use a index skip scan:
- https://wiki.postgresql.org/wiki/Loose_indexscan
- Index Skip Scan, https://commitfest.postgresql.org/37/1741/
- https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Part of #13019
This changes all the permission-related methods to rely on the Requester instead of the UserID. This is a first step towards enabling scoped access tokens at some point, since I expect the Requester to have scope-related informations in it.
It also changes methods which figure out the user/device/appservice out of the access token to return a Requester instead of something else. This avoids having store-related objects in the methods signatures.
Inspired by the room batch handler, this uses previous event inserts to
pre-populate prev events during room creation, reducing the number of
queries required to create a room.
Signed off by Nick @ Beeper (@Fizzadar)
Complement tests: https://github.com/matrix-org/complement/pull/405
This happens when you have some messages imported before the room is created.
Then use MSC3030 to look backwards before the room creation from a remote
federated server. The server won't find anything locally, but will ask over
federation which will have the remote event. The previous logic would
choke on not having the local event assigned.
```
Failed to fetch /timestamp_to_event from hs2 because of exception(UnboundLocalError) local variable 'local_event' referenced before assignment args=("local variable 'local_event' referenced before assignment",)
```
Instead, use the `room_version` property of the event we're validating.
The `room_version` was originally added as a parameter somewhere around #4482,
but really it's been redundant since #6875 added a `room_version` field to `EventBase`.
* Update worker docs to remove group endpoints.
* Removes an unused parameter to `ApplicationService`.
* Break dependency between media repo and groups.
* Avoid copying `m.room.related_groups` state events during room upgrades.
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.
Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.
Signed-off-by: Sean Quah <seanq@element.io>
This is some odds and ends found during the review of #11791
and while continuing to work in this code:
* Return attrs classes instead of dictionaries from some methods
to improve type safety.
* Call `get_bundled_aggregations` fewer times.
* Adds a missing assertion in the tests.
* Do not return empty bundled aggregations for an event (preferring
to not include the bundle at all, as the docstring states).
This makes the serialization of events synchronous (and it no
longer access the database), but we must manually calculate and
provide the bundled aggregations.
Overall this should cause no change in behavior, but is prep work
for other improvements.
This adds some opentracing annotations to ResponseCache, to make it easier to see what's going on; in particular, it adds a link back to the initial trace which is actually doing the work of generating the response.
MSC3030: https://github.com/matrix-org/matrix-doc/pull/3030
Client API endpoint. This will also go and fetch from the federation API endpoint if unable to find an event locally or we found an extremity with possibly a closer event we don't know about.
```
GET /_matrix/client/unstable/org.matrix.msc3030/rooms/<roomID>/timestamp_to_event?ts=<timestamp>&dir=<direction>
{
"event_id": ...
"origin_server_ts": ...
}
```
Federation API endpoint:
```
GET /_matrix/federation/unstable/org.matrix.msc3030/timestamp_to_event/<roomID>?ts=<timestamp>&dir=<direction>
{
"event_id": ...
"origin_server_ts": ...
}
```
Co-authored-by: Erik Johnston <erik@matrix.org>
If `room_list_publication_rules` was configured with a rule with a
non-wildcard alias and a room was created with an alias then an
internal server error would have been thrown.
This fixes the error and properly applies the publication rules
during room creation.
Co-authored-by: Dirk Klimpel <5740567+dklimpel@users.noreply.github.com>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
The shared ratelimit function was replaced with a dedicated
RequestRatelimiter class (accessible from the HomeServer
object).
Other properties were copied to each sub-class that inherited
from BaseHandler.
Broadly, the existing `event_auth.check` function has two parts:
* a validation section: checks that the event isn't too big, that it has the rught signatures, etc.
This bit is independent of the rest of the state in the room, and so need only be done once
for each event.
* an auth section: ensures that the event is allowed, given the rest of the state in the room.
This gets done multiple times, against various sets of room state, because it forms part of
the state res algorithm.
Currently, this is implemented with `do_sig_check` and `do_size_check` parameters, but I think
that makes everything hard to follow. Instead, we split the function in two and call each part
separately where it is needed.
This is in the context of creating new module callbacks that modules in https://github.com/matrix-org/synapse-dinsic can use, in an effort to reconcile the spam checker API in synapse-dinsic with the one in mainline.
This adds a callback that's fairly similar to user_may_create_room except it also allows processing based on the invites sent at room creation.
Adds missing type hints to methods in the synapse.handlers
module and requires all methods to have type hints there.
This also removes the unused construct_auth_difference method
from the FederationHandler.
This is part of my ongoing war against BaseHandler. I've moved kick_guest_users into RoomMemberHandler (since it calls out to that handler anyway), and split maybe_kick_guest_users into the two places it is called.
* Make historical messages available to federated servers
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
Follow-up to https://github.com/matrix-org/synapse/pull/9247
* Debug message not available on federation
* Add base starting insertion point when no chunk ID is provided
* Fix messages from multiple senders in historical chunk
Follow-up to https://github.com/matrix-org/synapse/pull/9247
Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716
---
Previously, Synapse would throw a 403,
`Cannot force another user to join.`,
because we were trying to use `?user_id` from a single virtual user
which did not match with messages from other users in the chunk.
* Remove debug lines
* Messing with selecting insertion event extremeties
* Move db schema change to new version
* Add more better comments
* Make a fake requester with just what we need
See https://github.com/matrix-org/synapse/pull/10276#discussion_r660999080
* Store insertion events in table
* Make base insertion event float off on its own
See https://github.com/matrix-org/synapse/pull/10250#issuecomment-875711889
Conflicts:
synapse/rest/client/v1/room.py
* Validate that the app service can actually control the given user
See https://github.com/matrix-org/synapse/pull/10276#issuecomment-876316455
Conflicts:
synapse/rest/client/v1/room.py
* Add some better comments on what we're trying to check for
* Continue debugging
* Share validation logic
* Add inserted historical messages to /backfill response
* Remove debug sql queries
* Some marker event implemntation trials
* Clean up PR
* Rename insertion_event_id to just event_id
* Add some better sql comments
* More accurate description
* Add changelog
* Make it clear what MSC the change is part of
* Add more detail on which insertion event came through
* Address review and improve sql queries
* Only use event_id as unique constraint
* Fix test case where insertion event is already in the normal DAG
* Remove debug changes
* Switch to chunk events so we can auth via power_levels
Previously, we were using `content.chunk_id` to connect one
chunk to another. But these events can be from any `sender`
and we can't tell who should be able to send historical events.
We know we only want the application service to do it but these
events have the sender of a real historical message, not the
application service user ID as the sender. Other federated homeservers
also have no indicator which senders are an application service on
the originating homeserver.
So we want to auth all of the MSC2716 events via power_levels
and have them be sent by the application service with proper
PL levels in the room.
* Switch to chunk events for federation
* Add unstable room version to support new historical PL
* Fix federated events being rejected for no state_groups
Add fix from https://github.com/matrix-org/synapse/pull/10439
until it merges.
* Only connect base insertion event to prev_event_ids
Per discussion with @erikjohnston,
https://matrix.to/#/!UytJQHLQYfvYWsGrGY:jki.re/$12bTUiObDFdHLAYtT7E-BvYRp3k_xv8w0dUQHibasJk?via=jki.re&via=matrix.org
* Make it possible to get the room_version with txn
* Allow but ignore historical events in unsupported room version
See https://github.com/matrix-org/synapse/pull/10245#discussion_r675592489
We can't reject historical events on unsupported room versions because homeservers without knowledge of MSC2716 or the new room version don't reject historical events either.
Since we can't rely on the auth check here to stop historical events on unsupported room versions, I've added some additional checks in the processing/persisting code (`synapse/storage/databases/main/events.py` -> `_handle_insertion_event` and `_handle_chunk_event`). I've had to do some refactoring so there is method to fetch the room version by `txn`.
* Move to unique index syntax
See https://github.com/matrix-org/synapse/pull/10245#discussion_r675638509
* High-level document how the insertion->chunk lookup works
* Remove create_event fallback for room_versions
See https://github.com/matrix-org/synapse/pull/10245/files#r677641879
* Use updated method name
Part of #9744
Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now.
`Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
- Update black version to the latest
- Run black auto formatting over the codebase
- Run autoformatting according to [`docs/code_style.md
`](80d6dc9783/docs/code_style.md)
- Update `code_style.md` docs around installing black to use the correct version
This PR allows `ThirdPartyEventRules` modules to view, manipulate and block changes to the state of whether a room is published in the public rooms directory.
While the idea of whether a room is in the public rooms list is not kept within an event in the room, `ThirdPartyEventRules` generally deal with controlling which modifications can happen to a room. Public rooms fits within that idea, even if its toggle state isn't controlled through a state event.
The idea is that in future tokens will encode a mapping of instance to position. However, we don't want to include the full instance name in the string representation, so instead we'll have a mapping between instance name and an immutable integer ID in the DB that we can use instead. We'll then do the lookup when we serialize/deserialize the token (we could alternatively pass around an `Instance` type that includes both the name and ID, but that turns out to be a lot more invasive).
This is *not* ready for production yet. Caveats:
1. We should write some tests...
2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.