mirror of
https://git.anonymousland.org/anonymousland/synapse.git
synced 2025-05-08 00:45:00 -04:00
Handle race between persisting an event and un-partial stating a room (#13100)
Whenever we want to persist an event, we first compute an event context, which includes the state at the event and a flag indicating whether the state is partial. After a lot of processing, we finally try to store the event in the database, which can fail for partial state events when the containing room has been un-partial stated in the meantime. We detect the race as a foreign key constraint failure in the data store layer and turn it into a special `PartialStateConflictError` exception, which makes its way up to the method in which we computed the event context. To make things difficult, the exception needs to cross a replication request: `/fed_send_events` for events coming over federation and `/send_event` for events from clients. We transport the `PartialStateConflictError` as a `409 Conflict` over replication and turn `409`s back into `PartialStateConflictError`s on the worker making the request. All client events go through `EventCreationHandler.handle_new_client_event`, which is called in *a lot* of places. Instead of trying to update all the code which creates client events, we turn the `PartialStateConflictError` into a `429 Too Many Requests` in `EventCreationHandler.handle_new_client_event` and hope that clients take it as a hint to retry their request. On the federation event side, there are 7 places which compute event contexts. 4 of them use outlier event contexts: `FederationEventHandler._auth_and_persist_outliers_inner`, `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and `FederationHandler.do_remotely_reject_invite`. These events won't have the partial state flag, so we do not need to do anything for then. The remaining 3 paths which create events are `FederationEventHandler.process_remote_join`, `FederationEventHandler.on_send_membership_event` and `FederationEventHandler._process_received_pdu`. We can't experience the race in `process_remote_join`, unless we're handling an additional join into a partial state room, which currently blocks, so we make no attempt to handle it correctly. `on_send_membership_event` is only called by `FederationServer._on_send_membership_event`, so we catch the `PartialStateConflictError` there and retry just once. `_process_received_pdu` is called by `on_receive_pdu` for incoming events and `_process_pulled_event` for backfill. The latter should never try to persist partial state events, so we ignore it. We catch the `PartialStateConflictError` in `on_receive_pdu` and retry just once. Refering to the graph of code paths in https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648 may make the above make more sense. Signed-off-by: Sean Quah <seanq@matrix.org>
This commit is contained in:
parent
6ba732fefe
commit
68db233f0c
10 changed files with 234 additions and 74 deletions
|
@ -45,6 +45,7 @@ from synapse.api.errors import (
|
|||
FederationDeniedError,
|
||||
FederationError,
|
||||
HttpResponseException,
|
||||
LimitExceededError,
|
||||
NotFoundError,
|
||||
RequestSendFailed,
|
||||
SynapseError,
|
||||
|
@ -64,6 +65,7 @@ from synapse.replication.http.federation import (
|
|||
ReplicationCleanRoomRestServlet,
|
||||
ReplicationStoreRoomOnOutlierMembershipRestServlet,
|
||||
)
|
||||
from synapse.storage.databases.main.events import PartialStateConflictError
|
||||
from synapse.storage.databases.main.events_worker import EventRedactBehaviour
|
||||
from synapse.storage.state import StateFilter
|
||||
from synapse.types import JsonDict, StateMap, get_domain_from_id
|
||||
|
@ -549,15 +551,29 @@ class FederationHandler:
|
|||
# https://github.com/matrix-org/synapse/issues/12998
|
||||
await self.store.store_partial_state_room(room_id, ret.servers_in_room)
|
||||
|
||||
max_stream_id = await self._federation_event_handler.process_remote_join(
|
||||
origin,
|
||||
room_id,
|
||||
auth_chain,
|
||||
state,
|
||||
event,
|
||||
room_version_obj,
|
||||
partial_state=ret.partial_state,
|
||||
)
|
||||
try:
|
||||
max_stream_id = (
|
||||
await self._federation_event_handler.process_remote_join(
|
||||
origin,
|
||||
room_id,
|
||||
auth_chain,
|
||||
state,
|
||||
event,
|
||||
room_version_obj,
|
||||
partial_state=ret.partial_state,
|
||||
)
|
||||
)
|
||||
except PartialStateConflictError as e:
|
||||
# The homeserver was already in the room and it is no longer partial
|
||||
# stated. We ought to be doing a local join instead. Turn the error into
|
||||
# a 429, as a hint to the client to try again.
|
||||
# TODO(faster_joins): `_should_perform_remote_join` suggests that we may
|
||||
# do a remote join for restricted rooms even if we have full state.
|
||||
logger.error(
|
||||
"Room %s was un-partial stated while processing remote join.",
|
||||
room_id,
|
||||
)
|
||||
raise LimitExceededError(msg=e.msg, errcode=e.errcode, retry_after_ms=0)
|
||||
|
||||
if ret.partial_state:
|
||||
# Kick off the process of asynchronously fetching the state for this
|
||||
|
@ -1567,11 +1583,6 @@ class FederationHandler:
|
|||
|
||||
# we raced against more events arriving with partial state. Go round
|
||||
# the loop again. We've already logged a warning, so no need for more.
|
||||
# TODO(faster_joins): there is still a race here, whereby incoming events which raced
|
||||
# with us will fail to be persisted after the call to `clear_partial_state_room` due to
|
||||
# having partial state.
|
||||
# https://github.com/matrix-org/synapse/issues/12988
|
||||
#
|
||||
continue
|
||||
|
||||
events = await self.store.get_events_as_list(
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue