Fixes https://github.com/element-hq/synapse/issues/17274, hopefully.
Basically, old versions of Synapse could advance streams without
persisting anything in the DB (fixed in #17229). On restart those
updates would get lost, and so the position of the stream would revert
to an older position. If this happened across an upgrade to a later
Synapse version which included #17215, then sync could get blocked
indefinitely (until the stream advanced to the position in the token).
We fix this by bounding the stream positions we'll wait for to the
maximum position of the underlying stream ID generator.
Fix bug where we don't get new to-device from remote if they resent a
message we've already persisted and have recorded in the DB twice.
`device_federation_inbox` table doesn't have a unique index, and so we
can race and store an entry in there twice. If we do so then
`simple_select_one_txn` will throw an error due to the query returning
more than one row. We should add an unique index, but it doesn't really
matter so lets just handle the case of multiple rows correctly for now.
Fixes up #17333, where we failed to actually send less data (the
`DISTINCT` didn't work due to `stream_id` being different).
We fix this by making it so that every device list outbound poke for a
given user ID has the same stream ID. We can't change the query to only
return e.g. max stream ID as the receivers look up the destinations to
send to by doing `SELECT WHERE stream_id = ?`
This is #17291 (which got reverted), with some added fixups, and change
so that tests actually pick up the error.
The problem was that we were not calculating any new chain IDs due to a
missing `not` in a condition.
Reduce the replication traffic of device lists, by not sending every
destination that needs to be sent the device list update over
replication. Instead a "hosts to send to have been calculated"
notification over replication, and then federation senders read the
destinations from the DB.
For non federation senders this should heavily reduce the impact of a
user in many large rooms changing a device.
The parse_integer function was previously made to reject negative values by
default in https://github.com/element-hq/synapse/pull/16920, but the
documentation stated otherwise. This fixes the documentation and also:
- Removes explicit negative=False parameters from call sites.
- Brings the negative default of parse_integer_from_args in alignment with
parse_integer.
This reverts commit bdf82efea5 (#17291)
This seems to have stopped persisting auth chains for new events, and so
is causing state res to fall back to the slow methods
We calculate the auth chain links outside of the main persist event
transaction to ensure that we do not block other event sending during
the calculation.
Sort is no longer configurable and we always sort rooms by the `stream_ordering` of the last event in the room or the point where the user can see up to in cases of leave/ban/invite/knock.
Add `event.internal_metadata.instance_name` (the worker instance that persisted the event) to go alongside the existing `event.internal_metadata.stream_ordering`.
`instance_name` is useful to properly compare and query for events with a token since you need to compare both the `stream_ordering` and `instance_name` against the vector clock/`instance_map` in the `RoomStreamToken`.
This is pre-requisite work and may be used in https://github.com/element-hq/synapse/pull/17293
Adding `event.internal_metadata.instance_name` was first mentioned in the initial Sliding Sync PR while pairing with @erikjohnston, see 09609cb0db (diff-5cd773fb307aa754bd3948871ba118b1ef0303f4d72d42a2d21e38242bf4e096R405-R410)
PR where this was introduced: https://github.com/matrix-org/synapse/pull/14817
### What does this affect?
`get_last_event_in_room_before_stream_ordering(...)` is used in Sync v2 in a lot of different state calculations.
`get_last_event_in_room_before_stream_ordering(...)` is also used in `/rooms/{roomId}/members`
https://github.com/matrix-org/matrix-spec-proposals/pull/4151
This is intended to be enabled by default for immediate use. When FCP is
complete, the unstable endpoint will be dropped and stable endpoint
supported instead - no backwards compatibility is expected for the
unstable endpoint.
Spawning from https://github.com/element-hq/synapse/pull/17187#discussion_r1619492779 around wanting to put `SlidingSyncBody` (parse the request in the rest layer), `SlidingSyncConfig` (from the rest layer, pass to the handler), `SlidingSyncResponse` (pass the response from the handler back to the rest layer to respond) somewhere that doesn't contaminate the imports and cause circular import issues.
- Moved Pydantic parsing models to `synapse/types/rest`
- Moved handler types to `synapse/types/handlers`
If the stream ID in the unconverted table is ahead of the device lists
ID gen, then it can break all /sync requests that had an ID from ahead
of the table.
The fix is to make sure we add the unconverted table to the list of
tables we check at start up.
Broke in https://github.com/element-hq/synapse/pull/17229
Fixes: #17013
Add logging for whether room keys are replaced
This is motivated by the Crypto team who need to diagnose crypto issues.
The existing opentracing logging is not enough because it is not enabled
for all users.
Based on [MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575): Sliding Sync
This iteration only focuses on returning the list of room IDs in the sliding window API (without sorting/filtering).
Rooms appear in the Sliding sync response based on:
- `invite`, `join`, `knock`, `ban` membership events
- Kicks (`leave` membership events where `sender` is different from the `user_id`/`state_key`)
- `newly_left` (rooms that were left during the given token range, > `from_token` and <= `to_token`)
- In order for bans/kicks to not show up, you need to `/forget` those rooms. This doesn't modify the event itself though and only adds the `forgotten` flag to `room_memberships` in Synapse. There isn't a way to tell when a room was forgotten at the moment so we can't factor it into the from/to range.
### Example request
`POST http://localhost:8008/_matrix/client/unstable/org.matrix.msc3575/sync`
```json
{
"lists": {
"foo-list": {
"ranges": [ [0, 99] ],
"sort": [ "by_notification_level", "by_recency", "by_name" ],
"required_state": [
["m.room.join_rules", ""],
["m.room.history_visibility", ""],
["m.space.child", "*"]
],
"timeline_limit": 100
}
}
}
```
Response:
```json
{
"next_pos": "s58_224_0_13_10_1_1_16_0_1",
"lists": {
"foo-list": {
"count": 1,
"ops": [
{
"op": "SYNC",
"range": [0, 99],
"room_ids": [
"!MmgikIyFzsuvtnbvVG:my.synapse.linux.server"
]
}
]
}
},
"rooms": {},
"extensions": {}
}
```
Use fully-qualified `PersistedEventPosition` (`instance_name` and `stream_ordering`) when returning `RoomsForUser` to facilitate proper comparisons and `RoomStreamToken` generation.
Spawning from https://github.com/element-hq/synapse/pull/17187 where we want to utilize this change
Otherwise things will get confused.
An alternative would be to make sure that for lagging stream we don't
return anything (and make sure the returned next_batch token doesn't go
backwards). But that is a faff.
We try and deduplicate in two places: 1) really early on, and 2) just
before we persist the event. The first case was broken due to it
occuring before the profile information was added, and so it thought the
event contents were different.
The second case did catch it and handle it correctly, however doing so
creates a redundant state group leading to bloat.
Fixes#3791
Fixes up #17239
We need to keep the spam check within the `try/except` block. Also makes
it so that we don't enter the top span twice.
Also also ensures that we get the right thumbnail length.
There is a problem with `StreamIdGenerator` where it can go backwards
over restarts when a stream ID is requested but then not inserted into
the DB. This is problematic if we want to land #17215, and is generally
a potential cause for all sorts of nastiness.
Instead of trying to fix `StreamIdGenerator`, we may as well move to
`MultiWriterIdGenerator` that does not suffer from this problem (the
latest positions are stored in `stream_positions` table). This involves
adding SQLite support to the class.
This only changes id generators that were already using
`MultiWriterIdGenerator` under postgres, a separate PR will move the
rest of the uses of `StreamIdGenerator` over.
Currently sending a to-device message to a user ID with a dodgy
destination is accepted, but then ends up spamming the logs when we try
and send to the destination.
An alternative would be to reject the request, but I'm slightly nervous
that could break things.
When a module rejects a piece of media we end up trying to close the
same logging context twice.
Instead of fixing the existing code we refactor to use an async context
manager, which is easier to write correctly.
The log format is the same as the request log format, except:
- fields that are specific to HTTP requests have been removed
- the task's params are included at the end of the log line.
These log lines are emitted:
- when the task function finishes — both completion and failure (and I
suppose it is possible for a task to become schedulable again?)
- every 5 minutes whilst it is running
Closes#17217.
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
This PR ports the logic from the
[synapse_auto_accept_invite](https://github.com/matrix-org/synapse-auto-accept-invite)
module into synapse.
I went with the naive approach of injecting the "module" next to where
third party modules are currently loaded. If there is a better/preferred
way to handle this, I'm all ears. It wasn't obvious to me if there was a
better location to add this logic that would cleanly apply to all
incoming invite events.
Relies on https://github.com/element-hq/synapse/pull/17166 to fix linter
errors.
Re-introduces #17191, and includes #17197 and #17214
The basic idea is to stop calling `get_rooms_for_user` everywhere, and
instead use the table `device_lists_changes_in_room`.
Commits reviewable one-by-one.
Removed `request_key` from the `SyncConfig` (moved outside as its own function parameter) so it doesn't have to flow into `_generate_sync_entry_for_xxx` methods. This way we can separate the concerns of caching from generating the response and reuse the `_generate_sync_entry_for_xxx` functions as we see fit. Plus caching doesn't really have anything to do with the config of sync.
Split from https://github.com/element-hq/synapse/pull/17167
Spawning from https://github.com/element-hq/synapse/pull/17167#discussion_r1601497279
It's almost always more efficient to query the rooms that have device
list changes, rather than looking at the list of all users whose devices
have changed and then look for shared rooms.
This is to allow clients to query the configured federation whitelist.
Disabled by default.
---------
Co-authored-by: Devon Hudson <devonhudson@librem.one>
Co-authored-by: devonh <devon.dmytro@gmail.com>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Linter errors are showing up in #17147 that are unrelated to that PR.
The errors do not currently show up on develop.
This PR aims to resolve the linter errors separately from #17147.
When there have been lots of changes compared with the number of
entities, we can do a fast(er) path.
Locally I ran some benchmarking, and the comparison seems to give the
best determination of which method we use.
This change will apply the `email` & `picture` provided by OIDC to the
new user account when registering a new user via OIDC. If the user is
directed to the account details form, this change makes sure they have
been selected before applying them, otherwise they are omitted. In
particular, this change ensures the values are carried through when
Synapse has consent configured, and the redirect to the consent form/s
are followed.
I have tested everything manually. Including:
- with/without consent configured
- allowing/not allowing the use of email/avatar (via
`sso_auth_account_details.html`)
- with/without automatic account detail population (by un/commenting the
`localpart_template` option in synapse config).
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [X] Pull request is based on the develop branch
* [X] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [X] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct
(run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
... when workers are unreachable, etc.
Fixes https://github.com/element-hq/synapse/issues/17117.
The general principle is just to make sure that we propagate any
exceptions to the JsonResource, so that we return an error code to the
sending server. That means that the sending server no longer considers
the message safely sent, so it will retry later.
In the issue, Erik mentions that an alternative solution would be to
persist the to-device messages into a table so that they can be retried.
This might be an improvement for performance, but even if we did that,
we still need this mechanism, since we might be unable to reach the
database. So, if we want to do that, it can be a later follow-up.
---------
Co-authored-by: Erik Johnston <erik@matrix.org>
Weakness in auth chain indexing allows DoS from remote room members
through disk fill and high CPU usage.
A remote Matrix user with malicious intent, sharing a room with Synapse
instances before 1.104.1, can dispatch specially crafted events to
exploit a weakness in how the auth chain cover index is calculated. This
can induce high CPU consumption and accumulate excessive data in the
database of such instances, resulting in a denial of service.
Servers in private federations, or those that do not federate, are not
affected.
Resurrecting https://github.com/matrix-org/synapse/pull/13918.
This should reduce IOPs incurred by joining to the events table to
lookup stream ordering, which happens in many receipt handling code
paths. Like the previous PR I believe sufficient time has passed between
the original migration in DB schema 72 and now to merge this as-is. It's
highly unlikely that both the migration is still ongoing AND (active)
users still have any receipts prior to that date.
In the unlikely event there is a receipt without a populated
`event_stream_ordering` synapse will behave just as it does now when
receipts exist for events that don't (yet): for push action calculation
the receipts are just ignored.
I've removed the validation on event IDs as this is already covered
here:
59ceabcb97/synapse/handlers/receipts.py (L189-L192)
PR #16942 removed an invalid optimisation that avoided pulling out state
for non-gappy syncs. This causes a large increase in DB usage. c.f.
#16941 for why that optimisation was wrong.
However, we can still optimise in the simple case where the events in
the timeline are a linear chain without any branching/merging of the
DAG.
cc. @richvdh
Before we were pulling out *all* read receipts for a user for every
event we pushed. Instead let's only pull out the relevant receipts.
This also pulled out the event rows for each receipt, causing load on
the events table.
This PR fixes a very, very niche edge-case, but I've got some more work
coming which will otherwise make the problem worse.
The bug happens when the syncing user leaves a room, and has a sync
filter which includes "left" rooms, but sets the timeline limit to 0. In
that case, the state returned in the `state` section is calculated
incorrectly.
The fix is to pass a token corresponding to the point that the user
leaves the room through to `compute_state_delta`.
Requests may require a User-Agent header, and the change in #16972
accidentally removed it, resulting in requests getting rejected causing
login to fail.
Fixes https://github.com/element-hq/synapse/issues/16680, as well as a
related bug, where servers which we had *never* successfully sent an
event to would not be retried.
In order to fix the case of pending to-device messages, we hook into the
existing `wake_destinations_needing_catchup` process, by extending it to
look for destinations that have pending to-device messages. The
federation transmission loop then attempts to send the pending to-device
messages as normal.
When a lot of locks are waiting for a single lock, notifying all locks
independently with `call_later` on each release is really costly and
incurs some kind of async contention, where the CPU is spinning a lot
for not much.
The included test is taking around 30s before the change, and 0.5s
after.
It was found following failing tests with
https://github.com/element-hq/synapse/pull/16827.
This PR aims to fix#16895, caused by a regression in #7 and not fixed
by #16903. The PR #16903 only fixes a starvation issue, where the CPU
isn't released. There is a second issue, where the execution is blocked.
This theory is supported by the flame graphs provided in #16895 and the
fact that I see the CPU usage reducing and far below the limit.
Since the changes in #7, the method `check_state_independent_auth_rules`
is called with the additional parameter `batched_auth_events`:
6fa13b4f92/synapse/handlers/federation_event.py (L1741-L1743)
It makes the execution enter this if clause, introduced with #151956fa13b4f92/synapse/event_auth.py (L178-L189)
There are two issues in the above code snippet.
First, there is the blocking issue. I'm not entirely sure if this is a
deadlock, starvation, or something different. In the beginning, I
thought the copy operation was responsible. It wasn't. Then I
investigated the nested `store.get_events` inside the function `update`.
This was also not causing the blocking issue. Only when I replaced the
set difference operation (`-` ) with a list comprehension, the blocking
was resolved. Creating and comparing sets with a very large amount of
events seems to be problematic.
This is how the flamegraph looks now while persisting outliers. As you
can see, the execution no longer locks up in the above function.
![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec)
Second, the copying here doesn't serve any purpose, because only a
shallow copy is created. This means the same objects from the original
dict are referenced. This fails the intention of protecting these
objects from mutation. The review of the original PR
https://github.com/matrix-org/synapse/pull/15195 had an extensive
discussion about this matter.
Various approaches to copying the auth_events were attempted:
1) Implementing a deepcopy caused issues due to
builtins.EventInternalMetadata not being pickleable.
2) Creating a dict with new objects akin to a deepcopy.
3) Creating a dict with new objects containing only necessary
attributes.
Concluding, there is no easy way to create an actual copy of the
objects. Opting for a deepcopy can significantly strain memory and CPU
resources, making it an inefficient choice. I don't see why the copy is
necessary in the first place. Therefore I'm proposing to remove it
altogether.
After these changes, I was able to successfully join these rooms,
without the main worker locking up:
- #synapse:matrix.org
- #element-android:matrix.org
- #element-web:matrix.org
- #ecips:matrix.org
- #ipfs-chatter:ipfs.io
- #python:matrix.org
- #matrix:matrix.org
Since Synapse 1.76.0, any module which registers a `on_new_event`
callback would brick the ability to join remote rooms.
This is because this callback tried to get the full state of the room,
which would end up in a deadlock.
Related:
https://github.com/matrix-org/synapse-auto-accept-invite/issues/18
The following module would brick the ability to join remote rooms:
```python
from typing import Any, Dict, Literal, Union
import logging
from synapse.module_api import ModuleApi, EventBase
logger = logging.getLogger(__name__)
class MyModule:
def __init__(self, config: None, api: ModuleApi):
self._api = api
self._config = config
self._api.register_third_party_rules_callbacks(
on_new_event=self.on_new_event,
)
async def on_new_event(self, event: EventBase, _state_map: Any) -> None:
logger.info(f"Received new event: {event}")
@staticmethod
def parse_config(_config: Dict[str, Any]) -> None:
return None
```
This is technically a breaking change, as we are now passing partial
state on the `on_new_event` callback.
However, this callback was broken for federated rooms since 1.76.0, and
local rooms have full state anyway, so it's unlikely that it would
change anything.