forked-synapse/synapse
Gerrit Gogel 1f88790764
Prevent locking up while processing batched_auth_events (#16968)
This PR aims to fix #16895, caused by a regression in #7 and not fixed
by #16903. The PR #16903 only fixes a starvation issue, where the CPU
isn't released. There is a second issue, where the execution is blocked.
This theory is supported by the flame graphs provided in #16895 and the
fact that I see the CPU usage reducing and far below the limit.

Since the changes in #7, the method `check_state_independent_auth_rules`
is called with the additional parameter `batched_auth_events`:


6fa13b4f92/synapse/handlers/federation_event.py (L1741-L1743)


It makes the execution enter this if clause, introduced with #15195


6fa13b4f92/synapse/event_auth.py (L178-L189)

There are two issues in the above code snippet.

First, there is the blocking issue. I'm not entirely sure if this is a
deadlock, starvation, or something different. In the beginning, I
thought the copy operation was responsible. It wasn't. Then I
investigated the nested `store.get_events` inside the function `update`.
This was also not causing the blocking issue. Only when I replaced the
set difference operation (`-` ) with a list comprehension, the blocking
was resolved. Creating and comparing sets with a very large amount of
events seems to be problematic.

This is how the flamegraph looks now while persisting outliers. As you
can see, the execution no longer locks up in the above function.

![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec)

Second, the copying here doesn't serve any purpose, because only a
shallow copy is created. This means the same objects from the original
dict are referenced. This fails the intention of protecting these
objects from mutation. The review of the original PR
https://github.com/matrix-org/synapse/pull/15195 had an extensive
discussion about this matter.

Various approaches to copying the auth_events were attempted:
1) Implementing a deepcopy caused issues due to
builtins.EventInternalMetadata not being pickleable.
2) Creating a dict with new objects akin to a deepcopy.
3) Creating a dict with new objects containing only necessary
attributes.

Concluding, there is no easy way to create an actual copy of the
objects. Opting for a deepcopy can significantly strain memory and CPU
resources, making it an inefficient choice. I don't see why the copy is
necessary in the first place. Therefore I'm proposing to remove it
altogether.

After these changes, I was able to successfully join these rooms,
without the main worker locking up:
- #synapse:matrix.org
- #element-android:matrix.org
- #element-web:matrix.org
- #ecips:matrix.org
- #ipfs-chatter:ipfs.io
- #python:matrix.org
- #matrix:matrix.org
2024-03-12 15:07:36 +00:00
..
_scripts Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
api Stabilize support for Retry-After header (MSC4014) (#16947) 2024-03-08 09:33:46 +00:00
app Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
appservice Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
config Stabilize support for Retry-After header (MSC4014) (#16947) 2024-03-08 09:33:46 +00:00
crypto Only do one concurrent fetch per server in keyring (#16894) 2024-02-09 10:51:11 +00:00
events Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
federation Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
handlers Don't lock up when joining large rooms (#16903) 2024-02-20 14:29:18 +00:00
http Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
logging Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
media Bump lxml-stubs from 0.4.0 to 0.5.1 (#16885) 2024-02-06 09:29:17 +00:00
metrics Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
module_api Fix joining remote rooms when a on_new_event callback is registered (#16973) 2024-03-06 16:00:20 +01:00
push Revert "Improve DB performance of calculating badge counts for push. (#16756)" (#16979) 2024-03-05 12:27:27 +00:00
replication Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
res Use oEmbed for YouTube Shorts (#15025) 2023-05-03 12:54:42 -04:00
rest deactivated flag refactored to filter deactivated users. (#16874) 2024-03-11 16:08:04 +00:00
server_notices Merge remote-tracking branch 'gitlab/clokep/license-license' into new_develop 2023-12-13 15:11:56 +00:00
spam_checker_api Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
state Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
static Update link to the clients webpage, fix #15825 (#15874) 2023-07-06 17:28:09 +02:00
storage deactivated flag refactored to filter deactivated users. (#16874) 2024-03-11 16:08:04 +00:00
streams Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
synapse_rust Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
types Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
util Don't invalidate the entire event cache when we purge history (#16905) 2024-02-13 13:24:11 +00:00
__init__.py Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
_pydantic_compat.py Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
event_auth.py Prevent locking up while processing batched_auth_events (#16968) 2024-03-12 15:07:36 +00:00
notifier.py Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
py.typed
server.py Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00
visibility.py Correctly mention previous copyright (#16820) 2024-01-23 11:26:48 +00:00