Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).
There are 3 ways we currently calculate hosts that are in the room:
1. `get_current_state` -> `get_domains_from_state`
- Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
- This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳
1. `get_current_hosts_in_room`
- Used for other federation things like sending read receipts and typing indicators
1. `get_hosts_in_room_at_events`
- Used when pushing out events over federation to other servers in the `_process_event_queue_loop`
Fix https://github.com/matrix-org/synapse/issues/13626
Part of https://github.com/matrix-org/synapse/issues/13356
Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)
### Query performance
#### Before
The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
```
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 16035.612 ms (00:16.036)
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 4243.237 ms (00:04.243)
```
But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.
```sh
$ psql synapse
synapse=# \timing on
synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
FROM current_state_events
WHERE
type = 'm.room.member'
AND membership = 'join'
AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
count
-------
4130
(1 row)
Time: 13181.598 ms (00:13.182)
synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
count
-------
80814
synapse=# SELECT COUNT(*) from current_state_events;
count
---------
8162847
synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
pg_size_pretty
----------------
4702 MB
```
#### After
I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.
After a Postgres restart, it takes 330ms and running back to back takes 260ms.
```sh
$ psql synapse
synapse=# \timing on
Timing is on.
synapse=# SELECT
substring(c.state_key FROM '@[^:]*:(.*)$') as host
FROM current_state_events c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE
c.type = 'm.room.member'
AND c.membership = 'join'
AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
GROUP BY host
ORDER BY min(e.depth) ASC;
Time: 333.800 ms
```
#### Going further
To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up.
Another thing we can do is optimize the query to use a index skip scan:
- https://wiki.postgresql.org/wiki/Loose_indexscan
- Index Skip Scan, https://commitfest.postgresql.org/37/1741/
- https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
Use dedicated `get_local_users_in_room` to find local users when calculating `join_authorised_via_users_server` ("the authorising user for joining a restricted room") of a `/make_join` request.
Found while working on https://github.com/matrix-org/synapse/pull/13575#discussion_r953023755 but it's not related.
Part of #13019
This changes all the permission-related methods to rely on the Requester instead of the UserID. This is a first step towards enabling scoped access tokens at some point, since I expect the Requester to have scope-related informations in it.
It also changes methods which figure out the user/device/appservice out of the access token to return a Requester instead of something else. This avoids having store-related objects in the methods signatures.
Use a state filter or accept partial state in a few places where we
request state, to avoid blocking.
To make lazy-loading `/sync`s work, we need to provide the memberships
of event senders, which are not guaranteed to be in the room state.
Instead we dig through auth events for memberships to present to
clients. The auth events of an event are guaranteed to contain a
passable membership event, otherwise the event would have been rejected.
Note that this only covers the common code paths encountered during
testing. There has been no exhaustive checking of all sync code paths.
Fixes#13146.
Signed-off-by: Sean Quah <seanq@matrix.org>
Add some miscellaneous comments to document sync, especially around
`compute_state_delta`.
Signed-off-by: Sean Quah <seanq@matrix.org>
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
This adds support for the stable identifiers of MSC2285 while
continuing to support the unstable identifiers behind the configuration
flag. These will be removed in a future version.
Signed-off-by: Andrew Doh <andrewddo@gmail.com>
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
Co-authored-by: Andrew Morgan <andrewm@element.io>
Co-authored-by: Brendan Abolivier <babolivier@matrix.org>
Previously, `_resolve_state_at_missing_prevs` returned the resolved
state before an event and a partial state flag. These were unwieldy to
carry around would only ever be used to build an event context. Build
the event context directly instead.
Signed-off-by: Sean Quah <seanq@matrix.org>
==============================
This RC reintroduces support for `account_threepid_delegates.email`, which was removed in 1.64.0rc1. It remains deprecated and will be removed altogether in a future release. ([\#13406](https://github.com/matrix-org/synapse/issues/13406))
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEv27Axt/F4vrTL/8QOSor00I9eP8FAmLj4a8ACgkQOSor00I9
eP8biQf/c8yY2mbeRZcBKtp6yoQCRYQvboSMEXyi+dLe1hNqdhSZwRQcAoFuAFwE
WdScDvoTaElUxv0v6eCI1x9CoXnZ6xpDShvK39j5Yhzv+1tNsm5Uq9imyG3jK5i6
U/3Gt6CrCsS01VkGslQ3B5I6MFtbC6ZZK9O48yg+GD8Oqw2HH/gllr5swyVbKdbc
GGhRBHvgXn+w6d/KnKt8uRxJqIpDt9JMga+WdB8CwFR5WnWbGdw24KsyxmBuOLC3
caQRiluJL/X4jApUpfsJMBBd/jrDod5wWDFO/4P+v0+2d3Ts+hKezZbt5h1VIYSw
szZXbzxn5RNDkNiJDpOOOMYQ5DXGmA==
=3/nK
-----END PGP SIGNATURE-----
Merge tag 'v1.64.0rc2' into develop
Synapse 1.64.0rc2 (2022-07-29)
==============================
This RC reintroduces support for `account_threepid_delegates.email`, which was removed in 1.64.0rc1. It remains deprecated and will be removed altogether in a future release. ([\#13406](https://github.com/matrix-org/synapse/issues/13406))
The `room_id` field represented the parent space for each room
and was made redundant by changes in the API shape where the
`children_state` is now nested underneath each `room`.
The room ID of each child is in the `state_key` field and is still
available.
Avoid blocking on full state in `_resolve_state_at_missing_prevs` and
return a new flag indicating whether the resolved state is partial.
Thread that flag around so that it makes it into the event context.
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Previously, TLS could only be used with STARTTLS.
Add a new option `force_tls`, where TLS is used from the start.
Implicit TLS is recommended over STARTLS,
see https://datatracker.ietf.org/doc/html/rfc8314Fixes#8046.
Signed-off-by: Jan Schär <jan@jschaer.ch>
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.
Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
When a room has the partial state flag, we may not have an accurate
`m.room.member` event for event senders in the room's current state, and
so cannot perform soft fail checks correctly. Skip the soft fail check
entirely in this case.
As an alternative, we could block until we have full state, but that
would prevent us from receiving incoming events over federation, which
is undesirable.
Signed-off-by: Sean Quah <seanq@matrix.org>
More prep work for asyncronous caching, also makes all process_replication_rows methods consistent (presence handler already is so).
Signed off by Nick @ Beeper (@Fizzadar)
* Replace `get_new_events_for_appservice` with `get_all_new_events_stream`
The functions were near identical and this brings the AS worker closer
to the way federation senders work which can allow for multiple workers
to handle AS traffic.
* Pull received TS alongside events when processing the stream
This avoids an extra query -per event- when both federation sender
and appservice pusher process events.
There is a corner in `_check_event_auth` (long known as "the weird corner") where, if we get an event with auth_events which don't match those we were expecting, we attempt to resolve the diffence between our state and the remote's with a state resolution.
This isn't specced, and there's general agreement we shouldn't be doing it.
However, it turns out that the faster-joins code was relying on it, so we need to introduce something similar (but rather simpler) for that.
* Drop support for v1 unbind
Signed-off-by: Jacek Kusnierz <jacek.kusnierz@tum.de>
* Add changelog
Signed-off-by: Jacek Kusnierz <jacek.kusnierz@tum.de>
* Update changelog.d/13240.misc
* Drop support for delegating email validation
Delegating email validation to an IS is insecure (since it allows the owner of
the IS to do a password reset on your HS), and has long been deprecated. It
will now cause a config error at startup.
* Update unit test which checks for email verification
Give it an `email` config instead of a threepid delegate
* Remove unused method `requestEmailToken`
* Simplify config handling for email verification
Rather than an enum and a boolean, all we need here is a single bool, which
says whether we are or are not doing email verification.
* update docs
* changelog
* upgrade.md: fix typo
* update version number
this will be in 1.64, not 1.63
* update version number
this one too
Inspired by the room batch handler, this uses previous event inserts to
pre-populate prev events during room creation, reducing the number of
queries required to create a room.
Signed off by Nick @ Beeper (@Fizzadar)
Complement tests: https://github.com/matrix-org/complement/pull/405
This happens when you have some messages imported before the room is created.
Then use MSC3030 to look backwards before the room creation from a remote
federated server. The server won't find anything locally, but will ask over
federation which will have the remote event. The previous logic would
choke on not having the local event assigned.
```
Failed to fetch /timestamp_to_event from hs2 because of exception(UnboundLocalError) local variable 'local_event' referenced before assignment args=("local variable 'local_event' referenced before assignment",)
```
Bounce recalculation of current state to the correct event persister and
move recalculation of current state into the event persistence queue, to
avoid concurrent updates to a room's current state.
Also give recalculation of a room's current state a real stream
ordering.
Signed-off-by: Sean Quah <seanq@matrix.org>
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.
We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.
To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.
All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.
On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.
The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.
We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.
`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.
`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.
Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.
Signed-off-by: Sean Quah <seanq@matrix.org>
When we fail to persist a federation event, we kick off a task to remove
its push actions in the background, using the current logging context.
Since we don't `await` that task, we may finish our logging context
before the task finishes. There's no reason to not `await` the task, so
let's do that.
Signed-off-by: Sean Quah <seanq@matrix.org>
* Add auth events to events used in tests
* Move some event auth checks out to a different method
Some of the event auth checks apply to an event's auth_events, rather than the
state at the event - which means they can play no part in state
resolution. Move them out to a separate method.
* Rename check_auth_rules_for_event
Now it only checks the state-dependent auth rules, it needs a better name.
Fixes#11887 hopefully.
The core change here is that `event_push_summary` now holds a summary of counts up until a much more recent point, meaning that the range of rows we need to count in `event_push_actions` is much smaller.
This needs two major changes:
1. When we get a receipt we need to recalculate `event_push_summary` rather than just delete it
2. The logic for deleting `event_push_actions` is now divorced from calculating `event_push_summary`.
In future it would be good to calculate `event_push_summary` while we persist a new event (it should just be a case of adding one to the relevant rows in `event_push_summary`), as that will further simplify the get counts logic and remove the need for us to periodically update `event_push_summary` in a background job.
This simplifies the access token verification logic by removing the `rights`
parameter which was only ever used for the unsubscribe link in email
notifications. The latter has been moved under the `/_synapse` namespace,
since it is not a standard API.
This also makes the email verification link more secure, by embedding the
app_id and pushkey in the macaroon and verifying it. This prevents the user
from tampering the query parameters of that unsubscribe link.
Macaroon generation is refactored:
- Centralised all macaroon generation and verification logic to the
`MacaroonGenerator`
- Moved to `synapse.utils`
- Changed the constructor to require only a `Clock`, hostname, and a secret key
(instead of a full `Homeserver`).
- Added tests for all methods.
Instead, use the `room_version` property of the event we're checking.
The `room_version` was originally added as a parameter somewhere around #4482,
but really it's been redundant since #6875 added a `room_version` field to `EventBase`.
Instead, use the `room_version` property of the event we're validating.
The `room_version` was originally added as a parameter somewhere around #4482,
but really it's been redundant since #6875 added a `room_version` field to `EventBase`.
By always using delete_devices and sometimes passing a list
with a single device ID.
Previously these methods had gotten out of sync with each
other and it seems there's little benefit to the single-device
variant.
* Update worker docs to remove group endpoints.
* Removes an unused parameter to `ApplicationService`.
* Break dependency between media repo and groups.
* Avoid copying `m.room.related_groups` state events during room upgrades.
Currently, we try to pull the event corresponding to a sync token from the database. However, when
we fetch redaction events, we check the target of that redaction (because we aren't allowed to send
redactions to clients without validating them). So, if the sync token points to a redaction of an event
that we don't have, we have a problem.
It turns out we don't really need that event, and can just work with its ID and metadata, which
sidesteps the whole problem.