Commit Graph

487 Commits

Author SHA1 Message Date
Richard van der Hoff
5c445114d3
Correctly account for cpu usage by background threads (#4074)
Wrap calls to deferToThread() in a thing which uses a child logcontext to
attribute CPU usage to the right request.

While we're in the area, remove the logcontext_tracer stuff, which is never
used, and afaik doesn't work.

Fixes #4064
2018-10-23 13:12:32 +01:00
Amber Brown
e1728dfcbe
Make scripts/ and scripts-dev/ pass pyflakes (and the rest of the codebase on py3) (#4068) 2018-10-20 11:16:55 +11:00
Amber Brown
e404ba9aac
Fix manhole on py3 (pt 2) (#4067) 2018-10-19 22:26:00 +11:00
Amber Brown
a36b0ec195 make a bytestring 2018-10-19 09:24:00 +11:00
Erik Johnston
6982320572 Remove unnecessary extra function call layer 2018-10-08 14:06:19 +01:00
Erik Johnston
8a1817f0d2 Use errback pattern and catch async failures 2018-10-08 13:29:47 +01:00
Erik Johnston
f7199e8734 Log looping call exceptions
If a looping call function errors, then it kills the loop entirely.
Currently it throws away the exception logs, so we should make it
actually log them.

Fixes #3929
2018-10-05 11:24:12 +01:00
Erik Johnston
4f3e3ac192 Correctly match 'dict.pop' api 2018-10-01 12:25:27 +01:00
Erik Johnston
8ea887856c Don't update eviction metrics on explicit removal 2018-10-01 12:00:58 +01:00
Richard van der Hoff
9c8cec5dab Merge remote-tracking branch 'origin/develop' into erikj/destination_retry_cache 2018-09-28 10:51:09 +01:00
Richard van der Hoff
4a15a3e4d5
Include eventid in log lines when processing incoming federation transactions (#3959)
when processing incoming transactions, it can be hard to see what's going on,
because we process a bunch of stuff in parallel, and because we may end up
recursively working our way through a chain of three or four events.

This commit creates a way to use logcontexts to add the relevant event ids to
the log lines.
2018-09-27 11:25:34 +01:00
Richard van der Hoff
5b4028fa78 Merge branch 'rav/fix_expiring_cache_len' into erikj/destination_retry_cache 2018-09-26 12:55:53 +01:00
Richard van der Hoff
7ee94fc1ba Log which cache is throwing exceptions 2018-09-26 12:43:08 +01:00
Erik Johnston
3baf6e1667 Fix ExpiringCache.__len__ to be accurate
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.

(This appears to have been a longstanding bug, but one which has been made more
of a problem by #3932 and #3933).

(This was originally done by Erik as part of #3933. I'm cherry-picking it
because really it's a fix in its own right)
2018-09-26 12:32:29 +01:00
Erik Johnston
19dc676d1a Fix ExpiringCache.__len__ to be accurate
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.
2018-09-21 16:25:42 +01:00
Erik Johnston
fdd1a62e8d Add a five minute cache to get_destination_retry_timings
Hopefully helps with #3931
2018-09-21 14:56:12 +01:00
Erik Johnston
79eded1ae4 Make ExpiringCache slightly more performant 2018-09-21 14:52:21 +01:00
Erik Johnston
8601c24287 Fix some instances of ExpiringCache not expiring cache items
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.

This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
2018-09-21 14:19:46 +01:00
Richard van der Hoff
642199570c
Improve the logging when handling a federation transaction (#3904)
Let's try to rationalise the logging that happens when we are processing an
incoming transaction, to make it easier to figure out what is going wrong when
they take ages. In particular:

- make everything start with a [room_id event_id] prefix
- make sure we log a warning when catching exceptions rather than just turning
  them into other, more cryptic, exceptions.
2018-09-19 17:28:18 +01:00
Erik Johnston
9407bcf37a Replace custom DeferredTimeoutError with defer.TimeoutError 2018-09-19 11:07:29 +01:00
Erik Johnston
6c48aa0256 Run canceller first to allow it to generate correct error 2018-09-19 11:07:27 +01:00
Erik Johnston
a334e1cace Update to use new timeout function everywhere.
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.

The new helper function doesn't suffer from this problem.
2018-09-19 10:39:40 +01:00
Erik Johnston
24efb2a70d Fix timeout function
Turns out deferred.cancel sometimes throws, so we do that last to ensure
that we always do resolve the new deferred.
2018-09-15 11:38:39 +01:00
Erik Johnston
fcfe7a850d Add an awful secondary timeout to fix wedged requests
This is an attempt to mitigate #3842 by adding yet-another-timeout
2018-09-14 19:23:07 +01:00
Erik Johnston
0a81038ea0 Add in flight real time metrics for Measure blocks 2018-09-14 15:08:37 +01:00
Erik Johnston
9e05c8d309 Change the manhole SSH key to have more bits
Newer versions of openssh client refuse to connect to the old key due to
its length.
2018-09-11 10:42:10 +01:00
Richard van der Hoff
be6527325a Fix exceptions when a connection is closed before we read the headers
This fixes bugs introduced in #3700, by making sure that we behave sanely
when an incoming connection is closed before the headers are read.
2018-08-20 18:21:10 +01:00
Richard van der Hoff
55e6bdf287 Robustness fix for logcontext filter
Make the logcontext filter not explode if it somehow ends up with a logcontext
of None, since that infinite-loops the whole logging system.
2018-08-20 18:20:07 +01:00
Amber Brown
324525f40c
Port over enough to get some sytests running on Python 3 (#3668) 2018-08-20 23:54:49 +10:00
Richard van der Hoff
c31793a784 Merge branch 'rav/fix_linearizer_cancellation' into develop 2018-08-10 14:57:27 +01:00
Amber Brown
b37c472419
Rename async to async_helpers because async is a keyword on Python 3.7 (#3678) 2018-08-10 23:50:21 +10:00
Richard van der Hoff
638d35ef08 Fix linearizer cancellation on twisted < 18.7
Turns out that cancellation of inlineDeferreds didn't really work properly
until Twisted 18.7. This commit refactors Linearizer.queue to avoid
inlineCallbacks.
2018-08-10 10:59:09 +01:00
Amber Brown
da7785147d
Python 3: Convert some unicode/bytes uses (#3569) 2018-08-02 00:54:06 +10:00
Richard van der Hoff
a8cbce0ced fix invalidation 2018-07-27 16:17:17 +01:00
Richard van der Hoff
f102c05856 Rewrite cache list decorator
Because it was complicated and annoyed me. I suspect this will be more
efficient too.
2018-07-27 13:47:04 +01:00
Richard van der Hoff
03751a6420 Fix some looping_call calls which were broken in #3604
It turns out that looping_call does check the deferred returned by its
callback, and (at least in the case of client_ips), we were relying on this,
and I broke it in #3604.

Update run_as_background_process to return the deferred, and make sure we
return it to clock.looping_call.
2018-07-26 11:48:08 +01:00
Richard van der Hoff
3d6df84658 Test and fix support for cancellation in Linearizer 2018-07-20 13:59:55 +01:00
Richard van der Hoff
7c712f95bb Combine Limiter and Linearizer
Linearizer was effectively a Limiter with max_count=1, so rather than
maintaining two sets of code, let's combine them.
2018-07-20 13:11:43 +01:00
Richard van der Hoff
8462c26485 Improvements to the Limiter
* give them names, to improve logging
* use a deque rather than a list for efficiency
2018-07-20 12:50:27 +01:00
Richard van der Hoff
d7275eecf3 Add a sleep to the Limiter to fix stack overflows.
Fixes #3570
2018-07-20 12:37:12 +01:00
Amber Brown
95ccb6e2ec
Don't spew errors because we can't save metrics (#3563) 2018-07-19 20:58:18 +10:00
Richard van der Hoff
8c69b735e3 Make Distributor run its processes as a background process
This is more involved than it might otherwise be, because the current
implementation just drops its logcontexts and runs everything in the sentinel
context.

It turns out that we aren't actually using a bunch of the functionality here
(notably suppress_failures and the fact that Distributor.fire returns a
deferred), so the easiest way to fix this is actually by simplifying a bunch of
code.
2018-07-18 20:55:05 +01:00
Richard van der Hoff
667fba68f3 Run things as background processes
This fixes #3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.

(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
2018-07-18 20:55:05 +01:00
Erik Johnston
b2aa05a8d6 Use efficient .intersection 2018-07-17 11:07:04 +01:00
Erik Johnston
547b1355d3 Fix perf regression in PR #3530
The get_entities_changed function was changed to return all changed
entities since the given stream position, rather than only those changed
from a given list of entities. This resulted in the function incorrectly
returning large numbers of entities that, for example, caused large
increases in database usage.
2018-07-17 10:27:51 +01:00
Amber Brown
3fe0938b76
Merge pull request #3530 from matrix-org/erikj/stream_cache
Don't return unknown entities in get_entities_changed
2018-07-17 13:44:46 +10:00
Richard van der Hoff
33b40d0a25 Make FederationRateLimiter queue requests properly
popitem removes the *most recent* item by default [1]. We want the oldest.

Fixes #3524

[1]: https://docs.python.org/2/library/collections.html#collections.OrderedDict.popitem
2018-07-13 16:19:40 +01:00
Erik Johnston
77b692e65d Don't return unknown entities in get_entities_changed
The stream cache keeps track of all entities that have changed since
a particular stream position, so get_entities_changed does not need to
return unknown entites when given a larger stream position.

This makes it consistent with the behaviour of has_entity_changed.
2018-07-13 15:26:10 +01:00
Richard van der Hoff
fa5c2bc082 Reduce set building in get_entities_changed
This line shows up as about 5% of cpu time on a synchrotron:

    not_known_entities = set(entities) - set(self._entity_to_key)

Presumably the problem here is that _entity_to_key can be largeish, and
building a set for its keys every time this function is called is slow.

Here we rewrite the logic to avoid building so many sets.
2018-07-12 11:37:44 +01:00
Richard van der Hoff
c3c29aa196
Attempt to include db threads in cpu usage stats (#3496)
Let's try to include time spent in the DB threads in the per-request/block cpu
usage metrics.
2018-07-10 16:12:36 +01:00