A lot of the things we log at INFO are now a bit superfluous, so lets
make them DEBUG logs to reduce the amount we log by default.
Co-Authored-By: Brendan Abolivier <babolivier@matrix.org>
Co-authored-by: Brendan Abolivier <github@brendanabolivier.com>
This gives a bit of a grace period where we can attempt to refetch a
remote `well-known`, while still using the cached result if that fails.
Hopefully this will make the well-known resolution a bit more torelant
of failures, rather than it immediately treating failures as "no result"
and caching that for an hour.
There was some inconsistent behaviour in the caching layer around how
exceptions were handled - particularly synchronously-thrown ones.
This seems to be most easily handled by pushing the creation of
ObservableDeferreds down from CacheDescriptor to the Cache.
Closes#4583
Does slightly less than #5045, which prevented a room from being upgraded multiple times, one after another. This PR still allows that, but just prevents two from happening at the same time.
Mostly just to mitigate the fact that servers are slow and it can take a moment for the room upgrade to actually complete. We don't want people sending another request to upgrade the room when really they just thought the first didn't go through.
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.
(This appears to have been a longstanding bug, but one which has been made more
of a problem by #3932 and #3933).
(This was originally done by Erik as part of #3933. I'm cherry-picking it
because really it's a fix in its own right)
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.
This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
It turns out that looping_call does check the deferred returned by its
callback, and (at least in the case of client_ips), we were relying on this,
and I broke it in #3604.
Update run_as_background_process to return the deferred, and make sure we
return it to clock.looping_call.
This fixes#3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.
(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
The get_entities_changed function was changed to return all changed
entities since the given stream position, rather than only those changed
from a given list of entities. This resulted in the function incorrectly
returning large numbers of entities that, for example, caused large
increases in database usage.
The stream cache keeps track of all entities that have changed since
a particular stream position, so get_entities_changed does not need to
return unknown entites when given a larger stream position.
This makes it consistent with the behaviour of has_entity_changed.
This line shows up as about 5% of cpu time on a synchrotron:
not_known_entities = set(entities) - set(self._entity_to_key)
Presumably the problem here is that _entity_to_key can be largeish, and
building a set for its keys every time this function is called is slow.
Here we rewrite the logic to avoid building so many sets.
When _get_state_for_groups is given a wildcard filter, just do a complete
lookup. Hopefully this will give us the best of both worlds by not filling up
the ram if we only need one or two keys, but also making the cache still work
for the federation reader usecase.