This is #17291 (which got reverted), with some added fixups, and change
so that tests actually pick up the error.
The problem was that we were not calculating any new chain IDs due to a
missing `not` in a condition.
Reduce the replication traffic of device lists, by not sending every
destination that needs to be sent the device list update over
replication. Instead a "hosts to send to have been calculated"
notification over replication, and then federation senders read the
destinations from the DB.
For non federation senders this should heavily reduce the impact of a
user in many large rooms changing a device.
This reverts commit bdf82efea5 (#17291)
This seems to have stopped persisting auth chains for new events, and so
is causing state res to fall back to the slow methods
We calculate the auth chain links outside of the main persist event
transaction to ensure that we do not block other event sending during
the calculation.
Sort is no longer configurable and we always sort rooms by the `stream_ordering` of the last event in the room or the point where the user can see up to in cases of leave/ban/invite/knock.
Add `event.internal_metadata.instance_name` (the worker instance that persisted the event) to go alongside the existing `event.internal_metadata.stream_ordering`.
`instance_name` is useful to properly compare and query for events with a token since you need to compare both the `stream_ordering` and `instance_name` against the vector clock/`instance_map` in the `RoomStreamToken`.
This is pre-requisite work and may be used in https://github.com/element-hq/synapse/pull/17293
Adding `event.internal_metadata.instance_name` was first mentioned in the initial Sliding Sync PR while pairing with @erikjohnston, see 09609cb0db (diff-5cd773fb307aa754bd3948871ba118b1ef0303f4d72d42a2d21e38242bf4e096R405-R410)
PR where this was introduced: https://github.com/matrix-org/synapse/pull/14817
### What does this affect?
`get_last_event_in_room_before_stream_ordering(...)` is used in Sync v2 in a lot of different state calculations.
`get_last_event_in_room_before_stream_ordering(...)` is also used in `/rooms/{roomId}/members`
https://github.com/matrix-org/matrix-spec-proposals/pull/4151
This is intended to be enabled by default for immediate use. When FCP is
complete, the unstable endpoint will be dropped and stable endpoint
supported instead - no backwards compatibility is expected for the
unstable endpoint.
If the stream ID in the unconverted table is ahead of the device lists
ID gen, then it can break all /sync requests that had an ID from ahead
of the table.
The fix is to make sure we add the unconverted table to the list of
tables we check at start up.
Broke in https://github.com/element-hq/synapse/pull/17229
Use fully-qualified `PersistedEventPosition` (`instance_name` and `stream_ordering`) when returning `RoomsForUser` to facilitate proper comparisons and `RoomStreamToken` generation.
Spawning from https://github.com/element-hq/synapse/pull/17187 where we want to utilize this change
Otherwise things will get confused.
An alternative would be to make sure that for lagging stream we don't
return anything (and make sure the returned next_batch token doesn't go
backwards). But that is a faff.
There is a problem with `StreamIdGenerator` where it can go backwards
over restarts when a stream ID is requested but then not inserted into
the DB. This is problematic if we want to land #17215, and is generally
a potential cause for all sorts of nastiness.
Instead of trying to fix `StreamIdGenerator`, we may as well move to
`MultiWriterIdGenerator` that does not suffer from this problem (the
latest positions are stored in `stream_positions` table). This involves
adding SQLite support to the class.
This only changes id generators that were already using
`MultiWriterIdGenerator` under postgres, a separate PR will move the
rest of the uses of `StreamIdGenerator` over.
Re-introduces #17191, and includes #17197 and #17214
The basic idea is to stop calling `get_rooms_for_user` everywhere, and
instead use the table `device_lists_changes_in_room`.
Commits reviewable one-by-one.
It's almost always more efficient to query the rooms that have device
list changes, rather than looking at the list of all users whose devices
have changed and then look for shared rooms.
Weakness in auth chain indexing allows DoS from remote room members
through disk fill and high CPU usage.
A remote Matrix user with malicious intent, sharing a room with Synapse
instances before 1.104.1, can dispatch specially crafted events to
exploit a weakness in how the auth chain cover index is calculated. This
can induce high CPU consumption and accumulate excessive data in the
database of such instances, resulting in a denial of service.
Servers in private federations, or those that do not federate, are not
affected.
Resurrecting https://github.com/matrix-org/synapse/pull/13918.
This should reduce IOPs incurred by joining to the events table to
lookup stream ordering, which happens in many receipt handling code
paths. Like the previous PR I believe sufficient time has passed between
the original migration in DB schema 72 and now to merge this as-is. It's
highly unlikely that both the migration is still ongoing AND (active)
users still have any receipts prior to that date.
In the unlikely event there is a receipt without a populated
`event_stream_ordering` synapse will behave just as it does now when
receipts exist for events that don't (yet): for push action calculation
the receipts are just ignored.
I've removed the validation on event IDs as this is already covered
here:
59ceabcb97/synapse/handlers/receipts.py (L189-L192)
Before we were pulling out *all* read receipts for a user for every
event we pushed. Instead let's only pull out the relevant receipts.
This also pulled out the event rows for each receipt, causing load on
the events table.
Fixes https://github.com/element-hq/synapse/issues/16680, as well as a
related bug, where servers which we had *never* successfully sent an
event to would not be retried.
In order to fix the case of pending to-device messages, we hook into the
existing `wake_destinations_needing_catchup` process, by extending it to
look for destinations that have pending to-device messages. The
federation transmission loop then attempts to send the pending to-device
messages as normal.