This fixes a bug in _delete_existing_rows_txn which was introduced in #3435
(though it's been on matrix-org-hotfixes for *years*). This code is only called
when there is some sort of conflict the first time we try to persist an event,
so it only happens rarely. Still, the exceptions are annoying.
We need to run the errback in the sentinel context to avoid losing our own
context.
Also: add logging to runInteraction to help identify where "Starting db
connection from sentinel context" warnings are coming from
We incorrectly asserted that all contexts must have a non None state
group without consider outliers. This would usually be fine as the
assertion would never be hit, as there is a shortcut during persistence
if the forward extremities don't change.
However, if the outlier is being persisted with non-outlier events, the
function would be called and the assertion would be hit.
Fixes#3601
We don't want to bother pulling out the current state from the DB since
until we know we have to. Checking the context for state is just an
optimisation.
This fixes#3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.
(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
This is in preparation for using contexts that may or may not have the
current_state_ids set. This will allow us to avoid unnecessarily pulling
out state for an event on the master process when using workers.
We also add a check to see if the state groups of the old extremities
are the same as the new ones.
It turns out that most of the time we were calling have_events, we were only
using half of the result. Replace have_events with have_seen_events and
get_rejection_reasons, so that we can see what's going on a bit more clearly.
This will allow us to measure how often we calculate state deltas in
event persistence that we would have been able to calculate at the same
time we calculated the state for the event.
The race happens when the user joins a room at the same time as doing a
sync. We fetch the current token and then get the rooms the user is in.
If the join happens after the current token, but before we get the rooms
we end up sending down a partial room entry in the sync.
This is fixed by looking at the stream ordering of the membership
returned by get_rooms_for_user, and handling the case when that stream
ordering is after the current token.
* Split state group persist into seperate storage func
* Add per database engine code for state group id gen
* Move store_state_group to StateReadStore
This allows other workers to use it, and so resolve state.
* Hook up store_state_group
* Fix tests
* Rename _store_mult_state_groups_txn
* Rename StateGroupReadStore
* Remove redundant _have_persisted_state_group_txn
* Update comments
* Comment compute_event_context
* Set start val for state_group_id_seq
... otherwise we try to recreate old state groups
* Update comments
* Don't store state for outliers
* Update comment
* Update docstring as state groups are ints
1. use `deferred.errback()` instead of `deferred.errback(e)`, which means that
a Failure object will be constructed using the current exception state,
*including* its stack trace - so the stack trace is saved in the Failure,
leading to better exception reports.
2. Set `consumeErrors=True` on the ObservableDeferred, because we know that
there will always be at least one observer - which avoids a spurious "CRITICAL:
unhandled exception in Deferred" error in the logs
ObserveableDeferred expects its callbacks to be called without any
logcontexts, whereas it turns out we were calling them with the logcontext of
the request which initiated the persistence loop.
It seems wrong that we are attributing work done in the persistence loop to the
request that happened to initiate it, so let's solve this by dropping the
logcontext for it.
(I'm not sure this actually causes any real problems other than messages in the
debug log, but let's clean it up anyway)
Add db_conn parameters to the `__init__` methods of the *Store classes, so that
they are all consistent, which makes the multiple inheritance work correctly
(and so that we can later extract mixins which can be used in the slavedstores)
This is due to the fact that we prefilled caches using txn.call_after,
which always gets called including on error.
We fix this by making txn.call_after only fire when a transaction
completes successfully, which is what we want most of the time anyway.