Try to avoid an OOM by checking fewer extremities.
Generally this is a big rewrite of _maybe_backfill, to try and fix some of the TODOs and other problems in it. It's best reviewed commit-by-commit.
Multiple calls to `EventsWorkerStore._get_events_from_cache_or_db` can
reuse the same database fetch, which is initiated by the first call.
Ensure that cancelling the first call doesn't cancel the other calls
sharing the same database fetch.
Signed-off-by: Sean Quah <seanq@element.io>
When we join a room via the faster-joins mechanism, we end up with "partial
state" at some points on the event DAG. Many parts of the codebase need to
wait for the full state to load. So, we implement a mechanism to keep track of
which events have partial state, and wait for them to be fully-populated.
We work through all the events with partial state, updating the state at each
of them. Once it's done, we recalculate the state for the whole room, and then
mark the room as having complete state.
* Add some type hints to datastore
* newsfile
* change `Collection` to `List`
* refactor return type of `select_users_txn`
* correct type hint in `stream.py`
* Remove `Optional` in `select_users_txn`
* remove not needed return type in `__init__`
* Revert change in `get_stream_id_for_event_txn`
* Remove import from `Literal`
Consider the requester's ignored users when calculating the
bundled aggregations.
See #12285 / 4df10d3214
for corresponding changes for the `/relations` endpoint.
Principally, `prometheus_client.REGISTRY.register` now requires its argument to
extend `prometheus_client.Collector`.
Additionally, `Gauge.set` is now annotated so that passing `Optional[int]`
causes an error.
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.
Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.
Signed-off-by: Sean Quah <seanq@element.io>
This is a first step in dealing with #7721.
The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:
1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.
However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.
There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).
Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
Switching to a sequence means there's no need to track `last_txn` on the
AS state table to generate new TXN IDs. This also means that there is
no longer contention between the AS scheduler and AS handler on updates
to the `application_services_state` table, which will prevent serialization
errors during the complete AS txn transaction.
It seems like calling `_get_state_group_for_events` for an event where the
state is unknown is an error. Accordingly, let's raise an exception rather than
silently returning an empty result.
If we're missing most of the events in the room state, then we may as well call the /state endpoint, instead of individually requesting each and every event.
The PaginationChunk class attempted to bundle some properties
together, but really just caused callers to jump through hoops and
hid implementation details.
This endpoint was removed from MSC2675 before it was approved.
It is currently unspecified (even in any MSCs) and therefore subject to
removal. It is not implemented by any known clients.
This also changes the bundled aggregation format for `m.annotation`,
which previously included pagination tokens for the `/aggregations`
endpoint, which are no longer useful.
When we are processing a `/backfill` request from a remote server, exclude any
outliers from consideration early on. We can't return outliers anyway (since we
don't know the state at the outlier), and filtering them out earlier means that
we won't attempt to calulate the state for them.
This should speed up push rule calculations for rooms with large numbers of local users when the main push rule cache fails.
Co-authored-by: reivilibre <oliverw@matrix.org>
We fetch the thread summary in two phases:
1. The summary that is shared by all users (count of messages and latest event).
2. Whether the requesting user has participated in the thread.
There's no use in attempting step 2 for events which did not return a summary
from step 1.
* Formally type the UserProfile in user searches
* export UserProfile in synapse.module_api
* Update docs
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
The unstable identifiers are still supported if the experimental configuration
flag is enabled. The unstable identifiers will be removed in a future release.
This is allowed per MSC2675, although the original implementation did
not allow for it and would return an empty chunk / not bundle aggregations.
The main thing to improve is that the various caches get cleared properly
when an event is redacted, and that edits must not leak if the original
event is redacted (as that would presumably leak something similar to
the original event content).
* `@cached` can now take an `uncached_args` which is an iterable of names to not use in the cache key.
* Requires `@cached`, @cachedList` and `@lru_cache` to use keyword arguments for clarity.
* Asserts that keyword-only arguments in cached functions are not accepted. (I tested this briefly and I don't believe this works properly.)