Weakness in auth chain indexing allows DoS from remote room members
through disk fill and high CPU usage.
A remote Matrix user with malicious intent, sharing a room with Synapse
instances before 1.104.1, can dispatch specially crafted events to
exploit a weakness in how the auth chain cover index is calculated. This
can induce high CPU consumption and accumulate excessive data in the
database of such instances, resulting in a denial of service.
Servers in private federations, or those that do not federate, are not
affected.
During the migration the automated script to update the copyright
headers accidentally got rid of some of the existing copyright lines.
Reinstate them.
For now this maintains compatible with old Synapses by falling back
to using transaction semantics on a per-access token. A future version
of Synapse will drop support for this.
Updates the database schema to require a thread_id (by adding a
constraint that the column is non-null) for event_push_actions,
event_push_actions_staging, and event_push_actions_summary.
For PostgreSQL we add the constraint as NOT VALID, then
VALIDATE the constraint a background job to avoid locking
the table during an upgrade.
Each table is updated as a separate schema delta to avoid
deadlocks between them.
For SQLite we simply rebuild the table & copy the data.
Updates the database schema to require a thread_id (by adding a
constraint that the column is non-null) for event_push_actions,
event_push_actions_staging, and event_push_actions_summary.
For PostgreSQL we add the constraint as NOT VALID, then
VALIDATE the constraint a background job to avoid locking
the table during an upgrade.
For SQLite we simply rebuild the table & copy the data.
Clean-up from adding the thread_id column, which was initially
null but backfilled with values. It is desirable to require it to now
be non-null.
In addition to altering this column to be non-null, we clean up
obsolete background jobs, indexes, and just-in-time updating
code.
* Add `event_stream_ordering` column to membership state tables
Specifically this adds the column to `current_state_events`,
`local_current_membership` and `room_memberships`. Each of these tables
is regularly joined with the `events` table to get the stream ordering
and denormalising this into each table will yield significant query
performance improvements once used.
* Make denormalised `event_stream_ordering` columns foreign keys
* Add comment in schema file explaining new denormalised columns
* Add triggers to enforce consistency of `event_stream_ordering` columns
* Re-order purge room tables to account for foreign keys
* Bump schema version to 75
Co-authored-by: David Robertson <david.m.robertson1@gmail.com>
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
When retrieving counts of notifications segment the results based on the
thread ID, but choose whether to return them as individual threads or as
a single summed field by letting the client opt-in via a sync flag.
The summarization code is also updated to be per thread, instead of per
room.
Adds a `thread_id` column to the `event_push_actions`, `event_push_actions_staging`,
and `event_push_summary` tables. This will notifications to be segmented by the thread
in a future pull request. The `thread_id` column stores the root event ID or the special
value `"main"`.
The `thread_id` column for `event_push_actions` and `event_push_summary` is
backfilled with `"main"` for all existing rows. New entries into `event_push_actions`
and `event_push_actions_staging` will get the proper thread ID.
`receipts_linearized` and `receipts_graph` also gain a `thread_id` column, which is similar,
except `NULL` is a special value meaning the receipt is "unthreaded".
See MSC3771 and MSC3773 for where this data will be useful.
* Remove redundant references to `event_edges.room_id`
We don't need to care about the room_id here, because we are already checking
the event id.
* Clean up the event_edges table
We make a number of changes to `event_edges`:
* We give the `room_id` and `is_state` columns defaults (null and false
respectively) so that we can stop populating them.
* We drop any rows that have `is_state` set true - they should no longer
exist.
* We drop any rows that do not exist in `events` - these should not exist
either.
* We drop the old unique constraint on all the colums, which wasn't much use.
* We create a new unique index on `(event_id, prev_event_id)`.
* We add a foreign key constraint to `events`.
These happen rather differently depending on whether we are on Postgres or
SQLite. For SQLite, we just rebuild the whole table, copying only the rows we
want to keep. For Postgres, we try to do things in the background as much as
possible.
* Stop populating `event_edges.room_id` and `is_state`
We can just rely on the defaults.
This is a first step in dealing with #7721.
The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:
1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.
However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.
There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).
Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
Switching to a sequence means there's no need to track `last_txn` on the
AS state table to generate new TXN IDs. This also means that there is
no longer contention between the AS scheduler and AS handler on updates
to the `application_services_state` table, which will prevent serialization
errors during the complete AS txn transaction.