Handle race between persisting an event and un-partial stating a room (#13100)

Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.

All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.

We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.

`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.

Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.

Signed-off-by: Sean Quah <seanq@matrix.org>
This commit is contained in:
Sean Quah 2022-07-05 16:12:52 +01:00 committed by GitHub
parent 6ba732fefe
commit 68db233f0c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
10 changed files with 234 additions and 74 deletions

View file

@ -64,6 +64,7 @@ from synapse.replication.http.federation import (
ReplicationFederationSendEventsRestServlet,
)
from synapse.state import StateResolutionStore
from synapse.storage.databases.main.events import PartialStateConflictError
from synapse.storage.databases.main.events_worker import EventRedactBehaviour
from synapse.storage.state import StateFilter
from synapse.types import (
@ -275,7 +276,16 @@ class FederationEventHandler:
affected=pdu.event_id,
)
await self._process_received_pdu(origin, pdu, state_ids=None)
try:
await self._process_received_pdu(origin, pdu, state_ids=None)
except PartialStateConflictError:
# The room was un-partial stated while we were processing the PDU.
# Try once more, with full state this time.
logger.info(
"Room %s was un-partial stated while processing the PDU, trying again.",
room_id,
)
await self._process_received_pdu(origin, pdu, state_ids=None)
async def on_send_membership_event(
self, origin: str, event: EventBase
@ -306,6 +316,9 @@ class FederationEventHandler:
Raises:
SynapseError if the event is not accepted into the room
PartialStateConflictError if the room was un-partial stated in between
computing the state at the event and persisting it. The caller should
retry exactly once in this case.
"""
logger.debug(
"on_send_membership_event: Got event: %s, signatures: %s",
@ -423,6 +436,8 @@ class FederationEventHandler:
Raises:
SynapseError if the response is in some way invalid.
PartialStateConflictError if the homeserver is already in the room and it
has been un-partial stated.
"""
create_event = None
for e in state:
@ -1084,10 +1099,14 @@ class FederationEventHandler:
state_ids: Normally None, but if we are handling a gap in the graph
(ie, we are missing one or more prev_events), the resolved state at the
event
event. Must not be partial state.
backfilled: True if this is part of a historical batch of events (inhibits
notification to clients, and validation of device keys.)
PartialStateConflictError: if the room was un-partial stated in between
computing the state at the event and persisting it. The caller should retry
exactly once in this case. Will never be raised if `state_ids` is provided.
"""
logger.debug("Processing event: %s", event)
assert not event.internal_metadata.outlier
@ -1933,6 +1952,9 @@ class FederationEventHandler:
event: The event itself.
context: The event context.
backfilled: True if the event was backfilled.
PartialStateConflictError: if attempting to persist a partial state event in
a room that has been un-partial stated.
"""
# this method should not be called on outliers (those code paths call
# persist_events_and_notify directly.)
@ -1985,6 +2007,10 @@ class FederationEventHandler:
Returns:
The stream ID after which all events have been persisted.
Raises:
PartialStateConflictError: if attempting to persist a partial state event in
a room that has been un-partial stated.
"""
if not event_and_contexts:
return self._store.get_room_max_stream_ordering()
@ -1993,14 +2019,19 @@ class FederationEventHandler:
if instance != self._instance_name:
# Limit the number of events sent over replication. We choose 200
# here as that is what we default to in `max_request_body_size(..)`
for batch in batch_iter(event_and_contexts, 200):
result = await self._send_events(
instance_name=instance,
store=self._store,
room_id=room_id,
event_and_contexts=batch,
backfilled=backfilled,
)
try:
for batch in batch_iter(event_and_contexts, 200):
result = await self._send_events(
instance_name=instance,
store=self._store,
room_id=room_id,
event_and_contexts=batch,
backfilled=backfilled,
)
except SynapseError as e:
if e.code == HTTPStatus.CONFLICT:
raise PartialStateConflictError()
raise
return result["max_stream_id"]
else:
assert self._storage_controllers.persistence