Commit Graph

825 Commits

Author SHA1 Message Date
Eric Eastwood
344a2f767c
Instrument FederationStateIdsServlet - /state_ids (#13499)
Instrument FederationStateIdsServlet - `/state_ids` so it's easier to follow what's going on in Jaeger when viewing a trace.
2022-08-15 19:41:23 +01:00
reivilibre
e9e6aacfbe
Faster Room Joins: prevent Synapse from answering federated join requests for a room which it has not fully joined yet. (#13416) 2022-08-04 16:27:04 +01:00
Eric Eastwood
92d21faf12
Instrument /messages for understandable traces in Jaeger (#13368)
In Jaeger:

 - Before: huge list of uncategorized database calls
 - After: nice and collapsible into units of work
2022-08-03 10:57:38 -05:00
Sean Quah
8d317f6da5
Fix error when out of servers to sync partial state with (#13432)
so that we raise the intended error instead.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-08-02 12:12:44 +01:00
reivilibre
e17e5c97e0
Faster Room Joins: don't leave a stuck room partial state flag if the join fails. (#13403) 2022-08-01 16:45:39 +00:00
David Teller
11f811470f
Uniformize spam-checker API, part 5: expand other spam-checker callbacks to return Tuple[Codes, dict] (#13044)
Signed-off-by: David Teller <davidt@element.io>
Co-authored-by: Brendan Abolivier <babolivier@matrix.org>
2022-07-11 16:52:10 +00:00
Sean Quah
1391a76cd2
Faster room joins: fix race in recalculation of current room state (#13151)
Bounce recalculation of current state to the correct event persister and
move recalculation of current state into the event persistence queue, to
avoid concurrent updates to a room's current state.

Also give recalculation of a room's current state a real stream
ordering.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-07 12:19:31 +00:00
Sean Quah
68db233f0c
Handle race between persisting an event and un-partial stating a room (#13100)
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.

All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.

We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.

`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.

Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-05 16:12:52 +01:00
David Teller
a164a46038
Uniformize spam-checker API, part 4: port other spam-checker callbacks to return Union[Allow, Codes]. (#12857)
Co-authored-by: Brendan Abolivier <babolivier@matrix.org>
2022-06-13 18:16:16 +00:00
Richard van der Hoff
f68b5e5773 Merge branch 'rav/simplify_event_auth_interface' into develop 2022-06-13 11:34:59 +01:00
Richard van der Hoff
c1b28b8842 Remove redundant room_version param from check_auth_rules_from_context
It's now implied by the room_version property on the event.
2022-06-12 23:13:10 +01:00
Richard van der Hoff
68be42f6b6 Remove room_version param from validate_event_for_room_version
Instead, use the `room_version` property of the event we're validating.

The `room_version` was originally added as a parameter somewhere around #4482,
but really it's been redundant since #6875 added a `room_version` field to `EventBase`.
2022-06-12 23:13:09 +01:00
Richard van der Hoff
7c6b2204d1
Faster joins: add issue links to the TODOs (#13004)
... to help us keep track of these things
2022-06-09 10:13:03 +00:00
Erik Johnston
e3163e2e11
Reduce the amount of state we pull from the DB (#12811) 2022-06-06 09:24:12 +01:00
Erik Johnston
888a29f412
Wait for lazy join to complete when getting current state (#12872) 2022-06-01 16:02:53 +01:00
Sean Quah
641908f72f
Faster room joins: Resume state re-syncing after a Synapse restart (#12813)
Signed-off-by: Sean Quah <seanq@matrix.org>
2022-05-31 15:15:08 +00:00
Sean Quah
2fba1076c5
Faster room joins: Try other destinations when resyncing the state of a partial-state room (#12812)
Signed-off-by: Sean Quah <seanq@matrix.org>
2022-05-31 15:50:29 +01:00
Erik Johnston
1e453053cb
Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
Erik Johnston
4660d9fdcf
Fix up state_store naming (#12871) 2022-05-25 12:59:04 +01:00
Shay
71e8afe34d
Update EventContext get_current_event_ids and get_prev_event_ids to accept state filters and update calls where possible (#12791) 2022-05-20 09:54:12 +01:00
Erik Johnston
c72d26c1e1
Refactor EventContext (#12689)
Refactor how the `EventContext` class works, with the intention of reducing the amount of state we fetch from the DB during event processing.

The idea here is to get rid of the cached `current_state_ids` and `prev_state_ids` that live in the `EventContext`, and instead defer straight to the database (and its caching). 

One change that may have a noticeable effect is that we now no longer prefill the `get_current_state_ids` cache on a state change. However, that query is relatively light, since its just a case of reading a table from the DB (unlike fetching state at an event which is more heavyweight). For deployments with workers this cache isn't even used.


Part of #12684
2022-05-10 19:43:13 +00:00
andrew do
01e625513a
remove constantly lib use and switch to enums. (#12624) 2022-05-04 11:26:11 +00:00
Richard van der Hoff
17d99f758a
Optimise backfill calculation (#12522)
Try to avoid an OOM by checking fewer extremities.

Generally this is a big rewrite of _maybe_backfill, to try and fix some of the TODOs and other problems in it. It's best reviewed commit-by-commit.
2022-04-26 10:27:11 +01:00
Richard van der Hoff
320186319a
Resync state after partial-state join (#12394)
We work through all the events with partial state, updating the state at each
of them. Once it's done, we recalculate the state for the whole room, and then
mark the room as having complete state.
2022-04-12 13:23:43 +00:00
Sean Quah
800ba87cc8
Refactor and convert Linearizer to async (#12357)
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.

Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.

Signed-off-by: Sean Quah <seanq@element.io>
2022-04-05 15:43:52 +01:00
Richard van der Hoff
afa17f0eab
Return a 404 from /state for an outlier (#12087)
* Replace `get_state_for_pdu` with  `get_state_ids_for_pdu` and `get_events_as_list`.
* Return a 404 from `/state` and `/state_ids` for an outlier
2022-03-21 11:23:32 +00:00
Richard van der Hoff
dc8d825ef2
Skip attempt to get state at backwards-extremities (#12173)
We don't *have* the state at a backwards-extremity, so this is never going to
do anything useful.
2022-03-09 11:00:48 +00:00
Richard van der Hoff
e2e1d90a5e
Faster joins: persist to database (#12012)
When we get a partial_state response from send_join, store information in the
database about it:
 * store a record about the room as a whole having partial state, and stash the
   list of member servers too.
 * flag the join event itself as having partial state
 * also, for any new events whose prev-events are partial-stated, note that
   they will *also* be partial-stated.

We don't yet make any attempt to interpret this data, so API calls (and a bunch
of other things) are just going to get incorrect data.
2022-03-01 12:49:54 +00:00
Richard van der Hoff
e24ff8ebe3
Remove HomeServer.get_datastore() (#12031)
The presence of this method was confusing, and mostly present for backwards
compatibility. Let's get rid of it.

Part of #11733
2022-02-23 11:04:02 +00:00
Richard van der Hoff
81364db49b
Run _handle_queued_pdus as a background process (#12041)
... to ensure it gets a proper log context, mostly.
2022-02-22 13:33:22 +00:00
Richard van der Hoff
3070af4809
remote join processing: get create event from state, not auth_chain (#12039)
A follow-up to #12005, in which I apparently missed that there are a bunch of other places that assume the create event is in the auth chain.
2022-02-21 19:27:35 +00:00
Eric Eastwood
fef2e792be
Fix historical messages backfilling in random order on remote homeservers (MSC2716) (#11114)
Fix https://github.com/matrix-org/synapse/issues/11091
Fix https://github.com/matrix-org/synapse/issues/10764 (side-stepping the issue because we no longer have to deal with `fake_prev_event_id`)

 1. Made the `/backfill` response return messages in `(depth, stream_ordering)` order (previously only sorted by `depth`)
    - Technically, it shouldn't really matter how `/backfill` returns things but I'm just trying to make the `stream_ordering` a little more consistent from the origin to the remote homeservers in order to get the order of messages from `/messages` consistent ([sorted by `(topological_ordering, stream_ordering)`](https://github.com/matrix-org/synapse/blob/develop/docs/development/room-dag-concepts.md#depth-and-stream-ordering)).
    - Even now that we return backfilled messages in order, it still doesn't guarantee the same `stream_ordering` (and more importantly the [`/messages` order](https://github.com/matrix-org/synapse/blob/develop/docs/development/room-dag-concepts.md#depth-and-stream-ordering)) on the other server. For example, if a room has a bunch of history imported and someone visits a permalink to a historical message back in time, their homeserver will skip over the historical messages in between and insert the permalink as the next message in the `stream_order` and totally throw off the sort.
       - This will be even more the case when we add the [MSC3030 jump to date API endpoint](https://github.com/matrix-org/matrix-doc/pull/3030) so the static archives can navigate and jump to a certain date.
       - We're solving this in the future by switching to [online topological ordering](https://github.com/matrix-org/gomatrixserverlib/issues/187) and [chunking](https://github.com/matrix-org/synapse/issues/3785) which by its nature will apply retroactively to fix any inconsistencies introduced by people permalinking
 2. As we're navigating `prev_events` to return in `/backfill`, we order by `depth` first (newest -> oldest) and now also tie-break based on the `stream_ordering` (newest -> oldest). This is technically important because MSC2716 inserts a bunch of historical messages at the same `depth` so it's best to be prescriptive about which ones we should process first. In reality, I think the code already looped over the historical messages as expected because the database is already in order.
 3. Making the historical state chain and historical event chain float on their own by having no `prev_events` instead of a fake `prev_event` which caused backfill to get clogged with an unresolvable event. Fixes https://github.com/matrix-org/synapse/issues/11091 and https://github.com/matrix-org/synapse/issues/10764
 4. We no longer find connected insertion events by finding a potential `prev_event` connection to the current event we're iterating over. We now solely rely on marker events which when processed, add the insertion event as an extremity and the federating homeserver can ask about it when time calls.
    - Related discussion, https://github.com/matrix-org/synapse/pull/11114#discussion_r741514793


Before | After
--- | ---
![](https://user-images.githubusercontent.com/558581/139218681-b465c862-5c49-4702-a59e-466733b0cf45.png) | ![](https://user-images.githubusercontent.com/558581/146453159-a1609e0a-8324-439d-ae44-e4bce43ac6d1.png)



#### Why aren't we sorting topologically when receiving backfill events?

> The main reason we're going to opt to not sort topologically when receiving backfill events is because it's probably best to do whatever is easiest to make it just work. People will probably have opinions once they look at [MSC2716](https://github.com/matrix-org/matrix-doc/pull/2716) which could change whatever implementation anyway.
> 
> As mentioned, ideally we would do this but code necessary to make the fake edges but it gets confusing and gives an impression of “just whyyyy” (feels icky). This problem also dissolves with online topological ordering.
>
> -- https://github.com/matrix-org/synapse/pull/11114#discussion_r741517138

See https://github.com/matrix-org/synapse/pull/11114#discussion_r739610091 for the technical difficulties
2022-02-07 15:54:13 -06:00
Richard van der Hoff
251b5567ec
Remove log_function and its uses (#11761)
I've never found this terribly useful. I think it was added in the early days
of Synapse, without much thought as to what would actually be useful to log,
and has just been cargo-culted ever since.

Rather, it tends to clutter up debug logs with useless information.
2022-01-18 13:06:04 +00:00
Sean Quah
0147b3de20
Add missing type hints to synapse.logging.context (#11556) 2021-12-14 17:35:28 +00:00
Eric Eastwood
a6f1a3abec
Add MSC3030 experimental client and federation API endpoints to get the closest event to a given timestamp (#9445)
MSC3030: https://github.com/matrix-org/matrix-doc/pull/3030

Client API endpoint. This will also go and fetch from the federation API endpoint if unable to find an event locally or we found an extremity with possibly a closer event we don't know about.
```
GET /_matrix/client/unstable/org.matrix.msc3030/rooms/<roomID>/timestamp_to_event?ts=<timestamp>&dir=<direction>
{
    "event_id": ...
    "origin_server_ts": ...
}
```

Federation API endpoint:
```
GET /_matrix/federation/unstable/org.matrix.msc3030/timestamp_to_event/<roomID>?ts=<timestamp>&dir=<direction>
{
    "event_id": ...
    "origin_server_ts": ...
}
```

Co-authored-by: Erik Johnston <erik@matrix.org>
2021-12-02 01:02:20 -06:00
Richard van der Hoff
f3efa0036b
Move _persist_auth_tree into FederationEventHandler (#11115)
This is just a lift-and-shift, because it fits more naturally here. We do
rename it to `process_remote_join` at the same time though.
2021-10-19 10:24:09 +01:00
Richard van der Hoff
a5d2ea3d08
Check *all* auth events for room id and rejection (#11009)
This fixes a bug where we would accept an event whose `auth_events` include
rejected events, if the rejected event was shadowed by another `auth_event`
with same `(type, state_key)`.

The approach is to pass a list of auth events into
`check_auth_rules_for_event` instead of a dict, which of course means updating
the call sites.

This is an extension of #10956.
2021-10-18 18:28:30 +01:00
Eric Eastwood
daf498e099
Fix 500 error on /messages when we accumulate more than 5 backward extremities (#11027)
Found while working on the Gitter backfill script and noticed
it only happened after we sent 7 batches, https://gitlab.com/gitterHQ/webapp/-/merge_requests/2229#note_665906390

When there are more than 5 backward extremities for a given depth,
backfill will throw an error because we sliced the extremity list
to 5 but then try to iterate over the full list. This causes
us to look for state that we never fetched and we get a `KeyError`.

Before when calling `/messages` when there are more than 5 backward extremities:
```
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/synapse/http/server.py", line 258, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/local/lib/python3.8/site-packages/synapse/http/server.py", line 446, in _async_render
    callback_return = await raw_callback_return
  File "/usr/local/lib/python3.8/site-packages/synapse/rest/client/room.py", line 580, in on_GET
    msgs = await self.pagination_handler.get_messages(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/pagination.py", line 396, in get_messages
    await self.hs.get_federation_handler().maybe_backfill(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 133, in maybe_backfill
    return await self._maybe_backfill_inner(room_id, current_depth, limit)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 386, in _maybe_backfill_inner
    likely_extremeties_domains = get_domains_from_state(states[e_id])
KeyError: '$zpFflMEBtZdgcMQWTakaVItTLMjLFdKcRWUPHbbSZJl'
```
2021-10-14 18:53:45 -05:00
Patrick Cloke
eb9ddc8c2e
Remove the deprecated BaseHandler. (#11005)
The shared ratelimit function was replaced with a dedicated
RequestRatelimiter class (accessible from the HomeServer
object).

Other properties were copied to each sub-class that inherited
from BaseHandler.
2021-10-08 07:44:43 -04:00
Patrick Cloke
d1bf5f7c9d
Strip "join_authorised_via_users_server" from join events which do not need it. (#10933)
This fixes a "Event not signed by authorising server" error when
transition room member from join -> join, e.g. when updating a
display name or avatar URL for restricted rooms.
2021-09-30 11:13:59 -04:00
Richard van der Hoff
428174f902
Split event_auth.check into two parts (#10940)
Broadly, the existing `event_auth.check` function has two parts:
 * a validation section: checks that the event isn't too big, that it has the rught signatures, etc. 
   This bit is independent of the rest of the state in the room, and so need only be done once 
   for each event.
 * an auth section: ensures that the event is allowed, given the rest of the state in the room.
   This gets done multiple times, against various sets of room state, because it forms part of
   the state res algorithm.

Currently, this is implemented with `do_sig_check` and `do_size_check` parameters, but I think
that makes everything hard to follow. Instead, we split the function in two and call each part
separately where it is needed.
2021-09-29 18:59:15 +01:00
Patrick Cloke
94b620a5ed
Use direct references for configuration variables (part 6). (#10916) 2021-09-29 06:44:15 -04:00
Richard van der Hoff
5279b9161b
Use RoomVersion objects (#10934)
Various refactors to use `RoomVersion` objects instead of room version identifiers.
2021-09-29 10:57:10 +01:00
Patrick Cloke
bb7fdd821b
Use direct references for configuration variables (part 5). (#10897) 2021-09-24 07:25:21 -04:00
Andrew Morgan
aa2c027792
Remove unnecessary parentheses around tuples returned from methods (#10889) 2021-09-23 11:59:07 +01:00
Richard van der Hoff
26f2bfedbf
Factor out a separate EventContext.for_outlier (#10883)
Constructing an EventContext for an outlier is actually really simple, and
there's no sense in going via an `async` method in the `StateHandler`.

This also means that we can resolve a bunch of FIXMEs.
2021-09-22 17:58:57 +01:00
Richard van der Hoff
8f2a52766b
Ensure we mark sent knocks as outliers (#10873) 2021-09-22 15:20:18 +01:00
Patrick Cloke
b3590614da
Require type hints in the handlers module. (#10831)
Adds missing type hints to methods in the synapse.handlers
module and requires all methods to have type hints there.

This also removes the unused construct_auth_difference method
from the FederationHandler.
2021-09-20 08:56:23 -04:00
Patrick Cloke
01c88a09cd
Use direct references for some configuration variables (#10798)
Instead of proxying through the magic getter of the RootConfig
object. This should be more performant (and is more explicit).
2021-09-13 13:07:12 -04:00
Eric Eastwood
dc75fb7f05
Populate rooms.creator field for easy lookup (#10697)
Part of https://github.com/matrix-org/synapse/pull/10566

 - Fill in creator whenever we insert into the rooms table
 - Add background update to backfill any missing creator values
2021-09-01 16:27:58 +01:00
Richard van der Hoff
1800aabfc2
Split FederationHandler in half (#10692)
The idea here is to take anything to do with incoming events and move it out to a separate handler, as a way of making FederationHandler smaller.
2021-08-26 21:41:44 +01:00
Richard van der Hoff
96715d7633
Make backfill and get_missing_events use the same codepath (#10645)
Given that backfill and get_missing_events are basically the same thing, it's somewhat crazy that we have entirely separate code paths for them. This makes backfill use the existing get_missing_events code, and then clears up all the unused code.
2021-08-26 18:34:57 +01:00
Richard van der Hoff
e81d62009e
Split on_receive_pdu in half (#10640)
Here we split on_receive_pdu into two functions (on_receive_pdu and process_pulled_event), rather than having both cases in the same method. There's a tiny bit of overlap, but not that much.
2021-08-19 17:05:12 +00:00
Richard van der Hoff
50af1efe4b
Extract _resolve_state_at_missing_prevs (#10624)
This is a follow-up to #10615: it takes the code that constructs the state at a backwards extremity, and extracts it to a separate method.
2021-08-19 17:31:40 +01:00
Richard van der Hoff
964f29cb6f
Refactor on_receive_pdu code (#10615)
* drop room pdu linearizer sooner

No point holding onto it while we recheck the db

* move out `missing_prevs` calculation

we're going to need `missing_prevs` whatever we do, so we may as well calculate
it eagerly and just update it if it gets outdated.

* Add another `if missing_prevs` condition

this should be a no-op, since all the code inside the block already checks `if
missing_prevs`

* reorder if conditions

This shouldn't change the logic at all.

* Push down `min_depth` read

No point reading it from the database unless we're going to use it.

* Collect the sent_to_us_directly code together

Move the remaining `sent_to_us_directly` code inside the `if
sent_to_us_directly` block.

* Properly separate the `not sent_to_us_directly` branch

Since the only way this second block is now reachable is if we
*didn't* go into the `sent_to_us_directly` branch, we can replace it with a
simple `else`.

* changelog
2021-08-18 12:36:22 +01:00
Richard van der Hoff
272b89d547
Stop setting the outlier flag for things that aren't (#10614)
Marking things as outliers to inhibit pushes is a sledgehammer to crack a
nut. Move the test further down the stack so that we just inhibit the thing we
want.
2021-08-17 13:13:42 +01:00
Richard van der Hoff
2d9ca4ca77
Clean up some logging in the federation event handler (#10591)
* Include outlier status in `str(event)`

In places where we log event objects, knowing whether or not you're dealing
with an outlier is super useful.

* Remove duplicated logging in get_missing_events

When we process events received from get_missing_events, we log them twice
(once in `_get_missing_events_for_pdu`, and once in `on_receive_pdu`). Reduce
the duplication by removing the logging in `on_receive_pdu`, and ensuring the
call sites do sensible logging.

* log in `on_receive_pdu` when we already have the event

* Log which prev_events we are missing

* changelog
2021-08-16 13:19:02 +01:00
Richard van der Hoff
1bebc0b78c
Clean up federation event auth code (#10539)
* drop old-room hack

pretty sure we don't need this any more.

* Remove incorrect comment about modifying `context`

It doesn't look like the supplied context is ever modified.

* Stop `_auth_and_persist_event` modifying its parameters

This is only called in three places. Two of them don't pass `auth_events`, and
the third doesn't use the dict after passing it in, so this should be non-functional.

* Stop `_check_event_auth` modifying its parameters

`_check_event_auth` is only called in three places. `on_send_membership_event`
doesn't pass an `auth_events`, and `prep` and `_auth_and_persist_event` do not
use the map after passing it in.

* Stop `_update_auth_events_and_context_for_auth` modifying its parameters

Return the updated auth event dict, rather than modifying the parameter.

This is only called from `_check_event_auth`.

* Improve documentation on `_auth_and_persist_event`

Rename `auth_events` parameter to better reflect what it contains.

* Improve documentation on `_NewEventInfo`

* Improve documentation on `_check_event_auth`

rename `auth_events` parameter to better describe what it contains

* changelog
2021-08-06 13:54:23 +01:00
Eric Eastwood
684d19a11c
Add support for MSC2716 marker events (#10498)
* Make historical messages available to federated servers

Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716

Follow-up to https://github.com/matrix-org/synapse/pull/9247

* Debug message not available on federation

* Add base starting insertion point when no chunk ID is provided

* Fix messages from multiple senders in historical chunk

Follow-up to https://github.com/matrix-org/synapse/pull/9247

Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716

---

Previously, Synapse would throw a 403,
`Cannot force another user to join.`,
because we were trying to use `?user_id` from a single virtual user
which did not match with messages from other users in the chunk.

* Remove debug lines

* Messing with selecting insertion event extremeties

* Move db schema change to new version

* Add more better comments

* Make a fake requester with just what we need

See https://github.com/matrix-org/synapse/pull/10276#discussion_r660999080

* Store insertion events in table

* Make base insertion event float off on its own

See https://github.com/matrix-org/synapse/pull/10250#issuecomment-875711889

Conflicts:
	synapse/rest/client/v1/room.py

* Validate that the app service can actually control the given user

See https://github.com/matrix-org/synapse/pull/10276#issuecomment-876316455

Conflicts:
	synapse/rest/client/v1/room.py

* Add some better comments on what we're trying to check for

* Continue debugging

* Share validation logic

* Add inserted historical messages to /backfill response

* Remove debug sql queries

* Some marker event implemntation trials

* Clean up PR

* Rename insertion_event_id to just event_id

* Add some better sql comments

* More accurate description

* Add changelog

* Make it clear what MSC the change is part of

* Add more detail on which insertion event came through

* Address review and improve sql queries

* Only use event_id as unique constraint

* Fix test case where insertion event is already in the normal DAG

* Remove debug changes

* Add support for MSC2716 marker events

* Process markers when we receive it over federation

* WIP: make hs2 backfill historical messages after marker event

* hs2 to better ask for insertion event extremity

But running into the `sqlite3.IntegrityError: NOT NULL constraint failed: event_to_state_groups.state_group`
error

* Add insertion_event_extremities table

* Switch to chunk events so we can auth via power_levels

Previously, we were using `content.chunk_id` to connect one
chunk to another. But these events can be from any `sender`
and we can't tell who should be able to send historical events.
We know we only want the application service to do it but these
events have the sender of a real historical message, not the
application service user ID as the sender. Other federated homeservers
also have no indicator which senders are an application service on
the originating homeserver.

So we want to auth all of the MSC2716 events via power_levels
and have them be sent by the application service with proper
PL levels in the room.

* Switch to chunk events for federation

* Add unstable room version to support new historical PL

* Messy: Fix undefined state_group for federated historical events

```
2021-07-13 02:27:57,810 - synapse.handlers.federation - 1248 - ERROR - GET-4 - Failed to backfill from hs1 because NOT NULL constraint failed: event_to_state_groups.state_group
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 1216, in try_backfill
    await self.backfill(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 1035, in backfill
    await self._auth_and_persist_event(dest, event, context, backfilled=True)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 2222, in _auth_and_persist_event
    await self._run_push_actions_and_persist_event(event, context, backfilled)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 2244, in _run_push_actions_and_persist_event
    await self.persist_events_and_notify(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/federation.py", line 3290, in persist_events_and_notify
    events, max_stream_token = await self.storage.persistence.persist_events(
  File "/usr/local/lib/python3.8/site-packages/synapse/logging/opentracing.py", line 774, in _trace_inner
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 320, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 237, in handle_queue_loop
    ret = await self._per_item_callback(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/persist_events.py", line 577, in _persist_event_batch
    await self.persist_events_store._persist_events_and_state_updates(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 176, in _persist_events_and_state_updates
    await self.db_pool.runInteraction(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 681, in runInteraction
    result = await self.runWithConnection(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 770, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 238, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 254, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 83, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 293, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/deprecate.py", line 298, in deprecatedFunction
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/compat.py", line 403, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 284, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 765, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 549, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/logging/utils.py", line 69, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 385, in _persist_events_txn
    self._store_event_state_mappings_txn(txn, events_and_contexts)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/events.py", line 2065, in _store_event_state_mappings_txn
    self.db_pool.simple_insert_many_txn(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 923, in simple_insert_many_txn
    txn.execute_batch(sql, vals)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 280, in execute_batch
    self.executemany(sql, args)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 300, in executemany
    self._do_execute(self.txn.executemany, sql, *args)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 330, in _do_execute
    return func(sql, *args)
sqlite3.IntegrityError: NOT NULL constraint failed: event_to_state_groups.state_group
```

* Revert "Messy: Fix undefined state_group for federated historical events"

This reverts commit 187ab28611546321e02770944c86f30ee2bc742a.

* Fix federated events being rejected for no state_groups

Add fix from https://github.com/matrix-org/synapse/pull/10439
until it merges.

* Adapting to experimental room version

* Some log cleanup

* Add better comments around extremity fetching code and why

* Rename to be more accurate to what the function returns

* Add changelog

* Ignore rejected events

* Use simplified upsert

* Add Erik's explanation of extra event checks

See https://github.com/matrix-org/synapse/pull/10498#discussion_r680880332

* Clarify that the depth is not directly correlated to the backwards extremity that we return

See https://github.com/matrix-org/synapse/pull/10498#discussion_r681725404

* lock only matters for sqlite

See https://github.com/matrix-org/synapse/pull/10498#discussion_r681728061

* Move new SQL changes to its own delta file

* Clean up upsert docstring

* Bump database schema version (62)
2021-08-04 12:07:57 -05:00
Eric Eastwood
d0b294ad97
Make historical events discoverable from backfill for servers without any scrollback history (MSC2716) (#10245)
* Make historical messages available to federated servers

Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716

Follow-up to https://github.com/matrix-org/synapse/pull/9247

* Debug message not available on federation

* Add base starting insertion point when no chunk ID is provided

* Fix messages from multiple senders in historical chunk

Follow-up to https://github.com/matrix-org/synapse/pull/9247

Part of MSC2716: https://github.com/matrix-org/matrix-doc/pull/2716

---

Previously, Synapse would throw a 403,
`Cannot force another user to join.`,
because we were trying to use `?user_id` from a single virtual user
which did not match with messages from other users in the chunk.

* Remove debug lines

* Messing with selecting insertion event extremeties

* Move db schema change to new version

* Add more better comments

* Make a fake requester with just what we need

See https://github.com/matrix-org/synapse/pull/10276#discussion_r660999080

* Store insertion events in table

* Make base insertion event float off on its own

See https://github.com/matrix-org/synapse/pull/10250#issuecomment-875711889

Conflicts:
	synapse/rest/client/v1/room.py

* Validate that the app service can actually control the given user

See https://github.com/matrix-org/synapse/pull/10276#issuecomment-876316455

Conflicts:
	synapse/rest/client/v1/room.py

* Add some better comments on what we're trying to check for

* Continue debugging

* Share validation logic

* Add inserted historical messages to /backfill response

* Remove debug sql queries

* Some marker event implemntation trials

* Clean up PR

* Rename insertion_event_id to just event_id

* Add some better sql comments

* More accurate description

* Add changelog

* Make it clear what MSC the change is part of

* Add more detail on which insertion event came through

* Address review and improve sql queries

* Only use event_id as unique constraint

* Fix test case where insertion event is already in the normal DAG

* Remove debug changes

* Switch to chunk events so we can auth via power_levels

Previously, we were using `content.chunk_id` to connect one
chunk to another. But these events can be from any `sender`
and we can't tell who should be able to send historical events.
We know we only want the application service to do it but these
events have the sender of a real historical message, not the
application service user ID as the sender. Other federated homeservers
also have no indicator which senders are an application service on
the originating homeserver.

So we want to auth all of the MSC2716 events via power_levels
and have them be sent by the application service with proper
PL levels in the room.

* Switch to chunk events for federation

* Add unstable room version to support new historical PL

* Fix federated events being rejected for no state_groups

Add fix from https://github.com/matrix-org/synapse/pull/10439
until it merges.

* Only connect base insertion event to prev_event_ids

Per discussion with @erikjohnston,
https://matrix.to/#/!UytJQHLQYfvYWsGrGY:jki.re/$12bTUiObDFdHLAYtT7E-BvYRp3k_xv8w0dUQHibasJk?via=jki.re&via=matrix.org

* Make it possible to get the room_version with txn

* Allow but ignore historical events in unsupported room version

See https://github.com/matrix-org/synapse/pull/10245#discussion_r675592489

We can't reject historical events on unsupported room versions because homeservers without knowledge of MSC2716 or the new room version don't reject historical events either.

Since we can't rely on the auth check here to stop historical events on unsupported room versions, I've added some additional checks in the processing/persisting code (`synapse/storage/databases/main/events.py` ->  `_handle_insertion_event` and `_handle_chunk_event`). I've had to do some refactoring so there is method to fetch the room version by `txn`.

* Move to unique index syntax

See https://github.com/matrix-org/synapse/pull/10245#discussion_r675638509

* High-level document how the insertion->chunk lookup works

* Remove create_event fallback for room_versions

See https://github.com/matrix-org/synapse/pull/10245/files#r677641879

* Use updated method name
2021-07-28 10:46:37 -05:00
Patrick Cloke
228decfce1
Update the MSC3083 support to verify if joins are from an authorized server. (#10254) 2021-07-26 12:17:00 -04:00
Brendan Abolivier
a743bf4694
Port the ThirdPartyEventRules module interface to the new generic interface (#10386)
Port the third-party event rules interface to the generic module interface introduced in v1.37.0
2021-07-20 12:39:46 +02:00
Jonathan de Jong
95e47b2e78
[pyupgrade] synapse/ (#10348)
This PR is tantamount to running 
```
pyupgrade --py36-plus --keep-percent-format `find synapse/ -type f -name "*.py"`
```

Part of #9744
2021-07-19 15:28:05 +01:00
Jonathan de Jong
98aec1cc9d
Use inline type hints in handlers/ and rest/. (#10382) 2021-07-16 18:22:36 +01:00
Erik Johnston
7695ca0618
Fix a number of logged errors caused by remote servers being down. (#10400) 2021-07-15 10:35:46 +01:00
Patrick Cloke
8d609435c0
Move methods involving event authentication to EventAuthHandler. (#10268)
Instead of mixing them with user authentication methods.
2021-07-01 14:25:37 -04:00
Richard van der Hoff
8165ba48b1
Return errors from send_join etc if the event is rejected (#10243)
Rather than persisting rejected events via `send_join` and friends, raise a 403 if someone tries to pull a fast one.
2021-06-24 16:00:08 +01:00
Richard van der Hoff
6e8fb42be7
Improve validation for send_{join,leave,knock} (#10225)
The idea here is to stop people sending things that aren't joins/leaves/knocks through these endpoints: previously you could send anything you liked through them. I wasn't able to find any security holes from doing so, but it doesn't sound like a good thing.
2021-06-24 15:30:49 +01:00
Richard van der Hoff
8beead66ae
Send out invite rejections and knocks over federation (#10223)
ensure that events sent via `send_leave` and `send_knock` are sent on to
the rest of the federation.
2021-06-23 12:54:50 +01:00
Andrew Morgan
182147195b
Check third party rules before persisting knocks over federation (#10212)
An accidental mis-ordering of operations during #6739 technically allowed an incoming knock event over federation in before checking it against any configured Third Party Access Rules modules.

This PR corrects that by performing the TPAR check *before* persisting the event.
2021-06-21 11:57:09 +01:00
Marcus
8070b893db
update black to 21.6b0 (#10197)
Reformat all files with the new version.

Signed-off-by: Marcus Hoffmann <bubu@bubu1.eu>
2021-06-17 15:20:06 +01:00
Eric Eastwood
a911dd768b
Add fields to better debug where events are being soft_failed (#10168)
Follow-up to https://github.com/matrix-org/synapse/pull/10156#discussion_r650292223
2021-06-17 14:59:45 +01:00
Patrick Cloke
9e5ab6dd58
Remove the experimental flag for knocking and use stable prefixes / endpoints. (#10167)
* Room version 7 for knocking.
* Stable prefixes and endpoints (both client and federation) for knocking.
* Removes the experimental configuration flag.
2021-06-15 07:45:14 -04:00
Eric Eastwood
b31daac01c
Add metrics to track how often events are soft_failed (#10156)
Spawned from missing messages we were seeing on `matrix.org` from a
federated Gtiter bridged room, https://gitlab.com/gitterHQ/webapp/-/issues/2770.
The underlying issue in Synapse is tracked by https://github.com/matrix-org/synapse/issues/10066
where the message and join event race and the message is `soft_failed` before the
`join` event reaches the remote federated server.

Less soft_failed events = better and usually this should only trigger for events
where people are doing bad things and trying to fuzz and fake everything.
2021-06-11 10:12:35 +01:00
Sorunome
d936371b69
Implement knock feature (#6739)
This PR aims to implement the knock feature as proposed in https://github.com/matrix-org/matrix-doc/pull/2403

Signed-off-by: Sorunome mail@sorunome.de
Signed-off-by: Andrew Morgan andrewm@element.io
2021-06-09 19:39:51 +01:00
Erik Johnston
a0101fc021
Handle /backfill returning no events (#10133)
Fixes #10123
2021-06-08 10:37:01 +01:00
Erik Johnston
a0cd8ae8cb
Don't try and backfill the same room in parallel. (#10116)
If backfilling is slow then the client may time out and retry, causing
Synapse to start a new `/backfill` before the existing backfill has
finished, duplicating work.
2021-06-04 10:47:58 +01:00
Erik Johnston
c96ab31dff
Limit number of events in a replication request (#10118)
Fixes #9956.
2021-06-04 10:35:47 +01:00
Richard van der Hoff
b4b2fd2ece
add a cache to have_seen_event (#9953)
Empirically, this helped my server considerably when handling gaps in Matrix HQ. The problem was that we would repeatedly call have_seen_events for the same set of (50K or so) auth_events, each of which would take many minutes to complete, even though it's only an index scan.
2021-06-01 12:04:47 +01:00
Brendan Abolivier
f828a70be3
Limit the number of events sent over replication when persisting events. (#10082) 2021-05-27 17:10:58 +01:00
Patrick Cloke
ac6bfcd52f
Refactor checking restricted join rules (#10007)
To be more consistent with similar code. The check now automatically
raises an AuthError instead of passing back a boolean. It also absorbs
some shared logic between callers.
2021-05-18 12:17:04 -04:00
Erik Johnston
2b2985b5cf
Improve performance of backfilling in large rooms. (#9935)
We were pulling the full auth chain for the room out of the DB each time
we backfilled, which can be *huge* for large rooms and is totally
unnecessary.
2021-05-10 13:29:02 +01:00
Erik Johnston
de8f0a03a3
Don't set the external cache if its been done recently (#9905) 2021-05-05 16:53:22 +01:00
Patrick Cloke
d924827da1
Check for space membership during a remote join of a restricted room (#9814)
When receiving a /send_join request for a room with join rules set to 'restricted',
check if the user is a member of the spaces defined in the 'allow' key of the join rules.

This only applies to an experimental room version, as defined in MSC3083.
2021-04-23 07:05:51 -04:00
Jonathan de Jong
495b214f4f
Fix (final) Bugbear violations (#9838) 2021-04-20 11:50:49 +01:00
Patrick Cloke
936e69825a
Separate creating an event context from persisting it in the federation handler (#9800)
This refactoring allows adding logic that uses the event context
before persisting it.
2021-04-14 12:35:28 -04:00
Patrick Cloke
e8816c6ace Revert "Check for space membership during a remote join of a restricted room. (#9763)"
This reverts commit cc51aaaa7a.

The PR was prematurely merged and not yet approved.
2021-04-14 12:33:37 -04:00
Patrick Cloke
cc51aaaa7a
Check for space membership during a remote join of a restricted room. (#9763)
When receiving a /send_join request for a room with join rules set to 'restricted',
check if the user is a member of the spaces defined in the 'allow' key of the join
rules.
    
This only applies to an experimental room version, as defined in MSC3083.
2021-04-14 12:32:20 -04:00
Jonathan de Jong
4b965c862d
Remove redundant "coding: utf-8" lines (#9786)
Part of #9744

Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now.

`Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
2021-04-14 15:34:27 +01:00
Jonathan de Jong
2ca4e349e9
Bugbear: Add Mutable Parameter fixes (#9682)
Part of #9366

Adds in fixes for B006 and B008, both relating to mutable parameter lint errors.

Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>
2021-04-08 22:38:54 +01:00
Patrick Cloke
d959d28730
Add type hints to the federation handler and server. (#9743) 2021-04-06 07:21:57 -04:00
Erik Johnston
963f4309fe
Make RateLimiter class check for ratelimit overrides (#9711)
This should fix a class of bug where we forget to check if e.g. the appservice shouldn't be ratelimited.

We also check the `ratelimit_override` table to check if the user has ratelimiting disabled. That table is really only meant to override the event sender ratelimiting, so we don't use any values from it (as they might not make sense for different rate limits), but we do infer that if ratelimiting is disabled for the user we should disabled all ratelimits.

Fixes #9663
2021-03-30 12:06:09 +01:00
Richard van der Hoff
af2248f8bf
Optimise missing prev_event handling (#9601)
Background: When we receive incoming federation traffic, and notice that we are missing prev_events from 
the incoming traffic, first we do a `/get_missing_events` request, and then if we still have missing prev_events,
we set up new backwards-extremities. To do that, we need to make a `/state_ids` request to ask the remote
server for the state at those prev_events, and then we may need to then ask the remote server for any events
in that state which we don't already have, as well as the auth events for those missing state events, so that we
can auth them.

This PR attempts to optimise the processing of that state request. The `state_ids` API returns a list of the state
events, as well as a list of all the auth events for *all* of those state events. The optimisation comes from the
observation that we are currently loading all of those auth events into memory at the start of the operation, but
we almost certainly aren't going to need *all* of the auth events. Rather, we can check that we have them, and
leave the actual load into memory for later. (Ideally the federation API would tell us which auth events we're
actually going to need, but it doesn't.)

The effect of this is to reduce the number of events that I need to load for an event in Matrix HQ from about
60000 to about 22000, which means it can stay in my in-memory cache, whereas previously the sheer number
of events meant that all 60K events had to be loaded from db for each request, due to the amount of cache
churn. (NB I've already tripled the size of the cache from its default of 10K).

Unfortunately I've ended up basically C&Ping `_get_state_for_room` and `_get_events_from_store_or_dest` into
a new method, because `_get_state_for_room` is also called during backfill, which expects the auth events to be
returned, so the same tricks don't work. That said, I don't really know why that codepath is completely different
(ultimately we're doing the same thing in setting up a new backwards extremity) so I've left a TODO suggesting
that we clean it up.
2021-03-15 13:51:02 +00:00
Richard van der Hoff
2b328d7e02
Improve logging when processing incoming transactions (#9596)
Put the room id in the logcontext, to make it easier to understand what's going on.
2021-03-12 15:08:03 +00:00
Patrick Cloke
2a99cc6524
Use the chain cover index in get_auth_chain_ids. (#9576)
This uses a simplified version of get_chain_cover_difference to calculate
auth chain of events.
2021-03-10 09:57:59 -05:00
Eric Eastwood
0a00b7ff14
Update black, and run auto formatting over the codebase (#9381)
- Update black version to the latest
 - Run black auto formatting over the codebase
    - Run autoformatting according to [`docs/code_style.md
`](80d6dc9783/docs/code_style.md)
 - Update `code_style.md` docs around installing black to use the correct version
2021-02-16 22:32:34 +00:00
Andrew Morgan
594f2853e0
Remove dead handled_events set in invite_join (#9394)
This PR removes a set that was created and [initially used](1d2a0040cf (diff-0bc92da3d703202f5b9be2d3f845e375f5b1a6bc6ba61705a8af9be1121f5e42R435-R436)), but is no longer today.

May help cut down a bit on the time it takes to accept invites.
2021-02-12 22:15:50 +00:00
Erik Johnston
ff55300b91
Honour ratelimit flag for application services for invite ratelimiting (#9302) 2021-02-03 10:17:37 +00:00
Erik Johnston
f2c1560eca
Ratelimit invites by room and target user (#9258) 2021-01-29 16:38:29 +00:00
Erik Johnston
dd8da8c5f6
Precompute joined hosts and store in Redis (#9198) 2021-01-26 13:57:31 +00:00