forked-synapse/synapse/storage
Sean Quah 68db233f0c
Handle race between persisting an event and un-partial stating a room (#13100)
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.

All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.

We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.

`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.

Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-05 16:12:52 +01:00
..
controllers Handle race between persisting an event and un-partial stating a room (#13100) 2022-07-05 16:12:52 +01:00
databases Handle race between persisting an event and un-partial stating a room (#13100) 2022-07-05 16:12:52 +01:00
engines Use dummy fallback engines if imports fail (#12979) 2022-06-07 17:33:55 +01:00
schema Merge remote-tracking branch 'origin/release-v1.62' into develop 2022-06-30 13:27:24 -04:00
util Wait for lazy join to complete when getting current state (#12872) 2022-06-01 16:02:53 +01:00
__init__.py Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
_base.py Reduce state pulled from DB due to sending typing and receipts over federation (#12964) 2022-06-06 16:46:11 +01:00
background_updates.py Replace noop background updates with DELETE. (#12954) 2022-06-13 14:06:27 -04:00
database.py Don't actually one-line the SQL statements we send to the DB (#13129) 2022-06-30 10:43:24 +02:00
keys.py Use auto_attribs/native type hints for attrs classes. (#11692) 2022-01-13 13:49:28 +00:00
prepare_database.py Use dummy fallback engines if imports fail (#12979) 2022-06-07 17:33:55 +01:00
push_rule.py Remove redundant "coding: utf-8" lines (#9786) 2021-04-14 15:34:27 +01:00
roommember.py Do not include rooms with an unknown room version in a sync response. (#10644) 2021-08-19 11:12:55 -04:00
state.py Faster joins: add issue links to the TODOs (#13004) 2022-06-09 10:13:03 +00:00
types.py Tidy up and type-hint the database engine modules (#12734) 2022-05-17 00:34:38 +01:00