Commit Graph

608 Commits

Author SHA1 Message Date
Sean Quah
68db233f0c
Handle race between persisting an event and un-partial stating a room (#13100)
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.

All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.

We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.

`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.

Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-05 16:12:52 +01:00
David Robertson
97e9fbe1b2
Type annotations in synapse.databases.main.devices (#13025)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-06-15 15:20:04 +00:00
Patrick Cloke
cf05258f76
Remove groups replication code. (#12900)
The replication logic for groups is no longer used, so the message
passing infrastructure can be removed.
2022-05-31 13:04:08 -04:00
Erik Johnston
1e453053cb
Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
reivilibre
39dee30f01
Send USER_IP commands on a different Redis channel, in order to reduce traffic to workers that do not process these commands. (#12809) 2022-05-20 15:28:23 +01:00
reivilibre
177b884ad7
Lay some foundation work to allow workers to only subscribe to some kinds of messages, reducing replication traffic. (#12672) 2022-05-19 16:29:08 +01:00
Andrew Morgan
83be72d76c
Add StreamKeyType class and replace string literals with constants (#12567) 2022-05-16 15:35:31 +00:00
Sean Quah
a559c8b0d9
Respect the @cancellable flag for ReplicationEndpoints (#12700)
While `ReplicationEndpoint`s register themselves via `JsonResource`,
they pass a method that calls the handler, instead of the handler itself,
to `register_paths`. As a result, `JsonResource` will not correctly pick
up the `@cancellable` flag and we have to apply it ourselves.

Signed-off-by: Sean Quah <seanq@element.io>
2022-05-11 12:25:39 +01:00
Shay
d80a7ab151
Update replication.md with info on TCP module structure (#12621) 2022-05-09 14:46:43 -07:00
Šimon Brandner
ef86cf3d28
Update _on_new_receipts() to work with MSC2285 changes. (#12636) 2022-05-05 13:25:51 +00:00
Erik Johnston
c0379d6e5b
Reduce log spam when running multiple event persisters (#12610) 2022-05-05 10:20:23 +01:00
Erik Johnston
d1cd96ce29
Add opentracing spans to calls to external cache (#12380) 2022-04-07 13:18:29 +01:00
Sean Quah
800ba87cc8
Refactor and convert Linearizer to async (#12357)
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.

Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.

Signed-off-by: Sean Quah <seanq@element.io>
2022-04-05 15:43:52 +01:00
Erik Johnston
66053b6bfb
Prefill more stream change caches. (#12372) 2022-04-05 14:26:41 +01:00
Erik Johnston
b446c99ac9
Prefill the device_list_stream_cache (#12367)
* Prefill the device_list_stream_cache

* Newsfile

* Newsfile
2022-04-04 20:12:25 +01:00
Erik Johnston
5c9e39e619
Track device list updates per room. (#12321)
This is a first step in dealing with #7721.

The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:

1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.

However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.

There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).

Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
2022-04-04 15:25:20 +01:00
reivilibre
f871222880
Move update_client_ip background job from the main process to the background worker. (#12251) 2022-04-01 13:08:55 +01:00
David Robertson
a2b00a4486
Bump black and click versions (#12320) 2022-03-29 10:41:19 +00:00
reivilibre
4a53f35737
Improve code documentation for the typing stream over replication. (#12211) 2022-03-11 14:00:15 +00:00
Patrick Cloke
3e4af36bc8
Rename get_tcp_replication to get_replication_command_handler. (#12192)
Since the object it returns is a ReplicationCommandHandler.

This is clean-up from adding support to Redis where the command handler
was added as an additional layer of abstraction from the TCP protocol.
2022-03-10 13:01:56 +00:00
Nick Mills-Barrett
180d8ff0d4
Retry some http replication failures (#12182)
This allows for the target process to be down for around a minute
which provides time for restarts during synapse upgrades/config updates.

Closes: #12178

Signed off by Nick Mills-Barrett nick@beeper.com
2022-03-09 14:53:28 +00:00
Patrick Cloke
d8bab6793c
Fix incorrect type hints for txredis. (#12042)
Some properties were marked as RedisProtocol instead of ConnectionHandler,
which wraps RedisProtocol instance(s).
2022-03-08 07:26:05 -05:00
Erik Johnston
423cca9efe
Spread out sending device lists to remote hosts (#12132) 2022-03-04 11:48:15 +00:00
Richard van der Hoff
e24ff8ebe3
Remove HomeServer.get_datastore() (#12031)
The presence of this method was confusing, and mostly present for backwards
compatibility. Let's get rid of it.

Part of #11733
2022-02-23 11:04:02 +00:00
Erik Johnston
6d14b3dabf
Better error message when failing to request from another process (#12060) 2022-02-22 15:52:08 +00:00
Patrick Cloke
d0e78af35e
Add missing type hints to synapse.replication. (#11938) 2022-02-08 11:03:08 -05:00
Patrick Cloke
6c0984e3f0
Remove unnecessary ignores due to Twisted upgrade. (#11939)
Twisted 22.1.0 fixed some internal type hints, allowing Synapse
to remove ignore calls for parameters to connectTCP.
2022-02-08 09:15:59 -05:00
Patrick Cloke
63d90f10ec
Add missing type hints to synapse.replication.http. (#11856) 2022-02-08 07:44:39 -05:00
Richard van der Hoff
2277275485
Stop reading from event_reference_hashes (#11794)
Preparation for dropping this table altogether. Part of #6574.
2022-01-21 09:18:10 +00:00
Patrick Cloke
10a88ba91c
Use auto_attribs/native type hints for attrs classes. (#11692) 2022-01-13 13:49:28 +00:00
Richard van der Hoff
2359ee3864
Remove redundant get_current_events_token (#11643)
* Push `get_room_{min,max_stream_ordering}` into StreamStore

Both implementations of this are identical, so we may as well push it down and
get rid of the abstract base class nonsense.

* Remove redundant `StreamStore` class

This is empty now

* Remove redundant `get_current_events_token`

This was an exact duplicate of `get_room_max_stream_ordering`, so let's get rid
of it.

* newsfile
2022-01-04 16:10:27 +00:00
Patrick Cloke
cbd82d0b2d
Convert all namedtuples to attrs. (#11665)
To improve type hints throughout the code.
2021-12-30 18:47:12 +00:00
Sean Quah
5305a5e881
Type hint the constructors of the data store classes (#11555) 2021-12-13 17:05:00 +00:00
Quentin Gliech
a15a893df8
Save the OIDC session ID (sid) with the device on login (#11482)
As a step towards allowing back-channel logout for OIDC.
2021-12-06 12:43:06 -05:00
Sean Quah
ffd858aa68
Add type hints to synapse/storage/databases/main/events_worker.py (#11411)
Also refactor the stream ID trackers/generators a bit and try to
document them better.
2021-11-26 18:41:31 +00:00
Patrick Cloke
5cace20bf1
Add missing type hints to synapse.app. (#11287) 2021-11-10 15:06:54 -05:00
Nick Barrett
af54167516
Enable passing typing stream writers as a list. (#11237)
This makes the typing stream writer config match the other stream writers
that only currently support a single worker.
2021-11-03 14:25:47 +00:00
Brendan Abolivier
c7a5e49664
Implement an on_new_event callback (#11126)
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
2021-10-26 15:17:36 +02:00
Sean Quah
2b82ec425f
Add type hints for most HomeServer parameters (#11095) 2021-10-22 18:15:41 +01:00
Sean Quah
6a67f3786a
Fix logging context warnings when losing replication connection (#10984)
Instead of triggering `__exit__` manually on the replication handler's
logging context, use it as a context manager so that there is an
`__enter__` call to balance the `__exit__`.
2021-10-15 13:10:58 +01:00
Sean Quah
6b18eb4430
Fix opentracing and Prometheus metrics for replication requests (#10996)
This commit fixes two bugs to do with decorators not instrumenting
`ReplicationEndpoint`'s `send_request` correctly. There are two
decorators on `send_request`: Prometheus' `Gauge.track_inprogress()`
and Synapse's `opentracing.trace`.

`Gauge.track_inprogress()` does not have any support for async
functions when used as a decorator. Since async functions behave like
regular functions that return coroutines, only the creation of the
coroutine was covered by the metric and none of the actual body of
`send_request`.

`Gauge.track_inprogress()` returns a regular, non-async function
wrapping `send_request`, which is the source of the next bug.
The `opentracing.trace` decorator would normally handle async functions
correctly, but since the wrapped `send_request` is a non-async function,
the decorator ends up suffering from the same issue as
`Gauge.track_inprogress()`: the opentracing span only measures the
creation of the coroutine and none of the actual function body.

Using `Gauge.track_inprogress()` as a context manager instead of a
decorator resolves both bugs.
2021-10-12 11:23:46 +01:00
David Robertson
51a5da74cc
Annotate synapse.storage.util (#10892)
Also mark `synapse.streams` as having has no untyped defs

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2021-10-08 14:25:16 +00:00
Patrick Cloke
f4b1a9a527
Require direct references to configuration variables. (#10985)
This removes the magic allowing accessing configurable
variables directly from the config object. It is now required
that a specific configuration class is used (e.g. `config.foo`
must be replaced with `config.server.foo`).
2021-10-06 10:47:41 -04:00
David Robertson
29364145b2
Pass str to twisted's IReactorTCP (#10895)
This follows a correction made in twisted/twisted#1664 and should fix our Twisted Trial CI job.

Until that change is in a twisted release, we'll have to ignore the type
of the `host` argument. I've raised #10899 to remind us to review the
issue in a few months' time.
2021-09-30 12:51:47 +01:00
Patrick Cloke
94b620a5ed
Use direct references for configuration variables (part 6). (#10916) 2021-09-29 06:44:15 -04:00
Patrick Cloke
bb7fdd821b
Use direct references for configuration variables (part 5). (#10897) 2021-09-24 07:25:21 -04:00
Patrick Cloke
01c88a09cd
Use direct references for some configuration variables (#10798)
Instead of proxying through the magic getter of the RootConfig
object. This should be more performant (and is more explicit).
2021-09-13 13:07:12 -04:00
Richard van der Hoff
1800aabfc2
Split FederationHandler in half (#10692)
The idea here is to take anything to do with incoming events and move it out to a separate handler, as a way of making FederationHandler smaller.
2021-08-26 21:41:44 +01:00
Andrew Morgan
84469bdac7
Remove the unused public_room_list_stream (#10565)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2021-08-17 14:02:50 +01:00
Richard van der Hoff
d9cb658c78
Fix up type hints for Twisted 21.7 (#10490)
Mostly this involves decorating a few Deferred declarations with extra type hints. We wrap the types in quotes to avoid runtime errors when running against older versions of Twisted that don't have generics on Deferred.
2021-07-28 12:04:11 +00:00