First some background: StreamChangeCache is used to keep track of what "entities" have
changed since a given stream ID. So for example, we might use it to keep track of when the last
to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2].
Now, it turns out that StreamChangeCache didn't support more than one thing being changed at
a given stream_id (this was part of the problem with #7206). However, it's entirely valid to send
to-device messages to more than one user at a time.
As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped
ok with having multiple things changing on the same stream ID, and it seems we never actually
use the methods which don't work on the stream change caches where we allow multiple
changes at the same stream ID. But that feels horribly fragile, hence: let's update
StreamChangeCache to properly support this, and add some typing and some more tests while
we're at it.
[1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301
[2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.
This code was introduced in #7024, and I hope this fixes#7206.
* Pull Sentinel out of LoggingContext
... and drop a few unnecessary references to it
* Factor out LoggingContext.current_context
move `current_context` and `set_context` out to top-level functions.
Mostly this means that I can more easily trace what's actually referring to
LoggingContext, but I think it's generally neater.
* move copy-to-parent into `stop`
this really just makes `start` and `stop` more symetric. It also means that it
behaves correctly if you manually `set_log_context` rather than using the
context manager.
* Replace `LoggingContext.alive` with `finished`
Turn `alive` into `finished` and make it a bit better defined.
A lot of the things we log at INFO are now a bit superfluous, so lets
make them DEBUG logs to reduce the amount we log by default.
Co-Authored-By: Brendan Abolivier <babolivier@matrix.org>
Co-authored-by: Brendan Abolivier <github@brendanabolivier.com>
`Measure` incorrectly assumed that it was the only thing being done by the parent `LoggingContext`. For instance, during a "renew group attestations" operation, hundreds of `outbound_request` calls could take place in parallel, all using the same `LoggingContext`. This would mean that any resources used during *any* of those calls would be reported against *all* of them, producing wildly inaccurate results.
Instead, we now give each `Measure` block its own `LoggingContext` (using the parent `LoggingContext` mechanism to ensure that the log lines look correct and that the metrics are ultimately propogated to the top level for reporting against requests/backgrond tasks).
This makes it easier to use in an async/await world.
Also fixes a bug where cache descriptors would occaisonally return a raw
value rather than a deferred.
We have set the max retry interval to a value larger than a postgres or
sqlite int can hold, which caused exceptions when updating the
destinations table.
To fix postgres we need to change the column to a bigint, and for sqlite
we lower the max interval to 2**62 (which is still incredibly long).
Fixes a bug where the default attribute maps were prioritised over
user-specified ones, resulting in incorrect mappings.
The problem is that if you call SPConfig.load() multiple times, it adds new
attribute mappers to a list. So by calling it with the default config first,
and then the user-specified config, we would always get the default mappers
before the user-specified mappers.
To solve this, let's merge the config dicts first, and then pass them to
SPConfig.
This gives a bit of a grace period where we can attempt to refetch a
remote `well-known`, while still using the cached result if that fails.
Hopefully this will make the well-known resolution a bit more torelant
of failures, rather than it immediately treating failures as "no result"
and caching that for an hour.
There was some inconsistent behaviour in the caching layer around how
exceptions were handled - particularly synchronously-thrown ones.
This seems to be most easily handled by pushing the creation of
ObservableDeferreds down from CacheDescriptor to the Cache.
The version of a module isn't going to change over the lifetime of the
process (assuming no funky hot reloading is going on, which it isn't),
so let's just cache the result to avoid spawning lots of git
subprocesses.
Fixes#5672.
- Put the default window_size back to 1000ms (broken by #5181)
- Make the `rc_federation` config actually do something
- fix an off-by-one error in the 'concurrent' limit
- Avoid creating an unused `_PerHostRatelimiter` object for every single
incoming request
Closes#4583
Does slightly less than #5045, which prevented a room from being upgraded multiple times, one after another. This PR still allows that, but just prevents two from happening at the same time.
Mostly just to mitigate the fact that servers are slow and it can take a moment for the room upgrade to actually complete. We don't want people sending another request to upgrade the room when really they just thought the first didn't go through.
Wrap calls to deferToThread() in a thing which uses a child logcontext to
attribute CPU usage to the right request.
While we're in the area, remove the logcontext_tracer stuff, which is never
used, and afaik doesn't work.
Fixes#4064
If a looping call function errors, then it kills the loop entirely.
Currently it throws away the exception logs, so we should make it
actually log them.
Fixes#3929
when processing incoming transactions, it can be hard to see what's going on,
because we process a bunch of stuff in parallel, and because we may end up
recursively working our way through a chain of three or four events.
This commit creates a way to use logcontexts to add the relevant event ids to
the log lines.
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.
(This appears to have been a longstanding bug, but one which has been made more
of a problem by #3932 and #3933).
(This was originally done by Erik as part of #3933. I'm cherry-picking it
because really it's a fix in its own right)
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.
This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
Let's try to rationalise the logging that happens when we are processing an
incoming transaction, to make it easier to figure out what is going wrong when
they take ages. In particular:
- make everything start with a [room_id event_id] prefix
- make sure we log a warning when catching exceptions rather than just turning
them into other, more cryptic, exceptions.
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.
The new helper function doesn't suffer from this problem.
Turns out that cancellation of inlineDeferreds didn't really work properly
until Twisted 18.7. This commit refactors Linearizer.queue to avoid
inlineCallbacks.
It turns out that looping_call does check the deferred returned by its
callback, and (at least in the case of client_ips), we were relying on this,
and I broke it in #3604.
Update run_as_background_process to return the deferred, and make sure we
return it to clock.looping_call.
This is more involved than it might otherwise be, because the current
implementation just drops its logcontexts and runs everything in the sentinel
context.
It turns out that we aren't actually using a bunch of the functionality here
(notably suppress_failures and the fact that Distributor.fire returns a
deferred), so the easiest way to fix this is actually by simplifying a bunch of
code.
This fixes#3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.
(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
The get_entities_changed function was changed to return all changed
entities since the given stream position, rather than only those changed
from a given list of entities. This resulted in the function incorrectly
returning large numbers of entities that, for example, caused large
increases in database usage.
The stream cache keeps track of all entities that have changed since
a particular stream position, so get_entities_changed does not need to
return unknown entites when given a larger stream position.
This makes it consistent with the behaviour of has_entity_changed.
This line shows up as about 5% of cpu time on a synchrotron:
not_known_entities = set(entities) - set(self._entity_to_key)
Presumably the problem here is that _entity_to_key can be largeish, and
building a set for its keys every time this function is called is slow.
Here we rewrite the logic to avoid building so many sets.
When _get_state_for_groups is given a wildcard filter, just do a complete
lookup. Hopefully this will give us the best of both worlds by not filling up
the ram if we only need one or two keys, but also making the cache still work
for the federation reader usecase.
When we finish processing a request, log the number of events we fetched from
the database to handle it.
[I'm trying to figure out which requests are responsible for large amounts of
event cache churn. It may turn out to be more helpful to add counts to the
prometheus per-request/block metrics, but that is an extension to this code
anyway.]
There's a frequent idiom I noticed where an iterable is split up into a
number of chunks/batches. Unfortunately that method does not work with
iterators like dict.keys() in python3. This implementation works with
iterators.
Signed-off-by: Adrian Tschira <nota@notafile.com>
So, it turns out that if you have a first `Deferred` `D1`, you can add a
callback which returns another `Deferred` `D2`, and `D2` must then complete
before any further callbacks on `D1` will execute (and later callbacks on `D1`
get the *result* of `D2` rather than `D2` itself).
So, `D1` might have `called=True` (as in, it has started running its
callbacks), but any new callbacks added to `D1` won't get run until `D2`
completes - so if you `yield D1` in an `inlineCallbacks` function, your `yield`
will 'block'.
In conclusion: some of our assumptions in `logcontext` were invalid. We need to
make sure that we don't optimise out the logcontext juggling when this
situation happens. Fortunately, it is easy to detect by checking `D1.paused`.
While I was going through uses of preserve_fn for other PRs, I converted places
which only use the wrapped function once to use run_in_background, to avoid
creating the function object.
There were a bunch of places where we fire off a process to happen in the
background, but don't have any exception handling on it - instead relying on
the unhandled error being logged when the relevent deferred gets
garbage-collected.
This is unsatisfactory for a number of reasons:
- logging on garbage collection is best-effort and may happen some time after
the error, if at all
- it can be hard to figure out where the error actually happened.
- it is logged as a scary CRITICAL error which (a) I always forget to grep for
and (b) it's not really CRITICAL if a background process we don't care about
fails.
So this is an attempt to add exception handling to everything we fire off into
the background.