Commit Graph

472 Commits

Author SHA1 Message Date
Erik Johnston
fdd1a62e8d Add a five minute cache to get_destination_retry_timings
Hopefully helps with #3931
2018-09-21 14:56:12 +01:00
Erik Johnston
79eded1ae4 Make ExpiringCache slightly more performant 2018-09-21 14:52:21 +01:00
Erik Johnston
8601c24287 Fix some instances of ExpiringCache not expiring cache items
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.

This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
2018-09-21 14:19:46 +01:00
Richard van der Hoff
642199570c
Improve the logging when handling a federation transaction (#3904)
Let's try to rationalise the logging that happens when we are processing an
incoming transaction, to make it easier to figure out what is going wrong when
they take ages. In particular:

- make everything start with a [room_id event_id] prefix
- make sure we log a warning when catching exceptions rather than just turning
  them into other, more cryptic, exceptions.
2018-09-19 17:28:18 +01:00
Erik Johnston
9407bcf37a Replace custom DeferredTimeoutError with defer.TimeoutError 2018-09-19 11:07:29 +01:00
Erik Johnston
6c48aa0256 Run canceller first to allow it to generate correct error 2018-09-19 11:07:27 +01:00
Erik Johnston
a334e1cace Update to use new timeout function everywhere.
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.

The new helper function doesn't suffer from this problem.
2018-09-19 10:39:40 +01:00
Erik Johnston
24efb2a70d Fix timeout function
Turns out deferred.cancel sometimes throws, so we do that last to ensure
that we always do resolve the new deferred.
2018-09-15 11:38:39 +01:00
Erik Johnston
fcfe7a850d Add an awful secondary timeout to fix wedged requests
This is an attempt to mitigate #3842 by adding yet-another-timeout
2018-09-14 19:23:07 +01:00
Erik Johnston
0a81038ea0 Add in flight real time metrics for Measure blocks 2018-09-14 15:08:37 +01:00
Erik Johnston
9e05c8d309 Change the manhole SSH key to have more bits
Newer versions of openssh client refuse to connect to the old key due to
its length.
2018-09-11 10:42:10 +01:00
Richard van der Hoff
be6527325a Fix exceptions when a connection is closed before we read the headers
This fixes bugs introduced in #3700, by making sure that we behave sanely
when an incoming connection is closed before the headers are read.
2018-08-20 18:21:10 +01:00
Richard van der Hoff
55e6bdf287 Robustness fix for logcontext filter
Make the logcontext filter not explode if it somehow ends up with a logcontext
of None, since that infinite-loops the whole logging system.
2018-08-20 18:20:07 +01:00
Amber Brown
324525f40c
Port over enough to get some sytests running on Python 3 (#3668) 2018-08-20 23:54:49 +10:00
Richard van der Hoff
c31793a784 Merge branch 'rav/fix_linearizer_cancellation' into develop 2018-08-10 14:57:27 +01:00
Amber Brown
b37c472419
Rename async to async_helpers because async is a keyword on Python 3.7 (#3678) 2018-08-10 23:50:21 +10:00
Richard van der Hoff
638d35ef08 Fix linearizer cancellation on twisted < 18.7
Turns out that cancellation of inlineDeferreds didn't really work properly
until Twisted 18.7. This commit refactors Linearizer.queue to avoid
inlineCallbacks.
2018-08-10 10:59:09 +01:00
Amber Brown
da7785147d
Python 3: Convert some unicode/bytes uses (#3569) 2018-08-02 00:54:06 +10:00
Richard van der Hoff
a8cbce0ced fix invalidation 2018-07-27 16:17:17 +01:00
Richard van der Hoff
f102c05856 Rewrite cache list decorator
Because it was complicated and annoyed me. I suspect this will be more
efficient too.
2018-07-27 13:47:04 +01:00
Richard van der Hoff
03751a6420 Fix some looping_call calls which were broken in #3604
It turns out that looping_call does check the deferred returned by its
callback, and (at least in the case of client_ips), we were relying on this,
and I broke it in #3604.

Update run_as_background_process to return the deferred, and make sure we
return it to clock.looping_call.
2018-07-26 11:48:08 +01:00
Richard van der Hoff
3d6df84658 Test and fix support for cancellation in Linearizer 2018-07-20 13:59:55 +01:00
Richard van der Hoff
7c712f95bb Combine Limiter and Linearizer
Linearizer was effectively a Limiter with max_count=1, so rather than
maintaining two sets of code, let's combine them.
2018-07-20 13:11:43 +01:00
Richard van der Hoff
8462c26485 Improvements to the Limiter
* give them names, to improve logging
* use a deque rather than a list for efficiency
2018-07-20 12:50:27 +01:00
Richard van der Hoff
d7275eecf3 Add a sleep to the Limiter to fix stack overflows.
Fixes #3570
2018-07-20 12:37:12 +01:00
Amber Brown
95ccb6e2ec
Don't spew errors because we can't save metrics (#3563) 2018-07-19 20:58:18 +10:00
Richard van der Hoff
8c69b735e3 Make Distributor run its processes as a background process
This is more involved than it might otherwise be, because the current
implementation just drops its logcontexts and runs everything in the sentinel
context.

It turns out that we aren't actually using a bunch of the functionality here
(notably suppress_failures and the fact that Distributor.fire returns a
deferred), so the easiest way to fix this is actually by simplifying a bunch of
code.
2018-07-18 20:55:05 +01:00
Richard van der Hoff
667fba68f3 Run things as background processes
This fixes #3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.

(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
2018-07-18 20:55:05 +01:00
Erik Johnston
b2aa05a8d6 Use efficient .intersection 2018-07-17 11:07:04 +01:00
Erik Johnston
547b1355d3 Fix perf regression in PR #3530
The get_entities_changed function was changed to return all changed
entities since the given stream position, rather than only those changed
from a given list of entities. This resulted in the function incorrectly
returning large numbers of entities that, for example, caused large
increases in database usage.
2018-07-17 10:27:51 +01:00
Amber Brown
3fe0938b76
Merge pull request #3530 from matrix-org/erikj/stream_cache
Don't return unknown entities in get_entities_changed
2018-07-17 13:44:46 +10:00
Richard van der Hoff
33b40d0a25 Make FederationRateLimiter queue requests properly
popitem removes the *most recent* item by default [1]. We want the oldest.

Fixes #3524

[1]: https://docs.python.org/2/library/collections.html#collections.OrderedDict.popitem
2018-07-13 16:19:40 +01:00
Erik Johnston
77b692e65d Don't return unknown entities in get_entities_changed
The stream cache keeps track of all entities that have changed since
a particular stream position, so get_entities_changed does not need to
return unknown entites when given a larger stream position.

This makes it consistent with the behaviour of has_entity_changed.
2018-07-13 15:26:10 +01:00
Richard van der Hoff
fa5c2bc082 Reduce set building in get_entities_changed
This line shows up as about 5% of cpu time on a synchrotron:

    not_known_entities = set(entities) - set(self._entity_to_key)

Presumably the problem here is that _entity_to_key can be largeish, and
building a set for its keys every time this function is called is slow.

Here we rewrite the logic to avoid building so many sets.
2018-07-12 11:37:44 +01:00
Richard van der Hoff
c3c29aa196
Attempt to include db threads in cpu usage stats (#3496)
Let's try to include time spent in the DB threads in the per-request/block cpu
usage metrics.
2018-07-10 16:12:36 +01:00
Richard van der Hoff
55370331da
Refactor logcontext resource usage tracking (#3501)
Factor out the resource usage tracking out to a separate object, which can be
passed around and copied independently of the logcontext itself.
2018-07-10 13:56:07 +01:00
Amber Brown
49af402019 run isort 2018-07-09 16:09:20 +10:00
Amber Brown
6350bf925e
Attempt to be more performant on PyPy (#3462) 2018-06-28 14:49:57 +01:00
Amber Brown
72d2143ea8
Revert "Revert "Try to not use as much CPU in the StreamChangeCache"" (#3454) 2018-06-28 11:04:18 +01:00
Matthew Hodgson
8057489b26
Revert "Try to not use as much CPU in the StreamChangeCache" 2018-06-26 18:09:01 +01:00
Amber Brown
1202508067 fixes 2018-06-26 17:29:01 +01:00
Amber Brown
bd3d329c88 fixes 2018-06-26 17:28:12 +01:00
Amber Brown
abfe4b2957 try and make loading items from the cache faster 2018-06-26 17:25:34 +01:00
Amber Brown
07cad26d65
Remove all global reactor imports & pass it around explicitly (#3424) 2018-06-25 14:08:28 +01:00
Richard van der Hoff
43e02c409d Disable partial state group caching for wildcard lookups
When _get_state_for_groups is given a wildcard filter, just do a complete
lookup. Hopefully this will give us the best of both worlds by not filling up
the ram if we only need one or two keys, but also making the cache still work
for the federation reader usecase.
2018-06-22 11:52:07 +01:00
Richard van der Hoff
70e6501913
Merge pull request #3419 from matrix-org/rav/events_per_request
Log number of events fetched from DB
2018-06-22 11:17:56 +01:00
Richard van der Hoff
0495fe0035 Indirect evt_count updates via method call
so that we can stub it for the sentinel and not have a billion failing UTs
2018-06-22 10:42:28 +01:00
Amber Brown
77ac14b960
Pass around the reactor explicitly (#3385) 2018-06-22 09:37:10 +01:00
Richard van der Hoff
b088aafcae Log number of events fetched from DB
When we finish processing a request, log the number of events we fetched from
the database to handle it.

[I'm trying to figure out which requests are responsible for large amounts of
event cache churn. It may turn out to be more helpful to add counts to the
prometheus per-request/block metrics, but that is an extension to this code
anyway.]
2018-06-21 06:15:03 +01:00
Amber Brown
a61738b316
Remove run_on_reactor (#3395) 2018-06-14 18:27:37 +10:00