Commit Graph

741 Commits

Author SHA1 Message Date
Richard van der Hoff
916cf2d439
re-implement daemonize (#8011)
This has long been something I've wanted to do. Basically the `Daemonize` code
is both too flexible and not flexible enough, in that it offers a bunch of
features that we don't use (changing UID, closing FDs in the child, logging to
syslog) and doesn't offer a bunch that we could do with (redirecting stdout/err
to a file instead of /dev/null; having the parent not exit until the child is
running).

As a first step, I've lifted the Daemonize code and removed the bits we don't
use. This should be a non-functional change. Fixing everything else will come
later.
2020-08-04 10:03:41 +01:00
Karthikeyan Singaravelan
a7b06a81f0
Fix deprecation warning: import ABC from collections.abc (#7892) 2020-07-20 13:33:04 -04:00
Patrick Cloke
6b3ac3b8cd
Convert device handler to async/await (#7871) 2020-07-17 07:09:25 -04:00
Patrick Cloke
38e1fac886
Fix some spelling mistakes / typos. (#7811) 2020-07-09 09:52:58 -04:00
Dirk Klimpel
21a212f8e5
Fix inconsistent handling of upper and lower cases of email addresses. (#7021)
fixes #7016
2020-07-03 14:03:13 +01:00
Patrick Cloke
231252516c
Fix "argument of type 'ObservableDeferred' is not iterable" error (#7708) 2020-06-16 12:01:18 -04:00
Dagfinn Ilmari Mannsåker
a3f11567d9
Replace all remaining six usage with native Python 3 equivalents (#7704) 2020-06-16 08:51:47 -04:00
Patrick Cloke
bd6dc17221
Replace iteritems/itervalues/iterkeys with native versions. (#7692) 2020-06-15 07:03:36 -04:00
Andrew Morgan
f4e6495b5d
Performance improvements and refactor of Ratelimiter (#7595)
While working on https://github.com/matrix-org/synapse/issues/5665 I found myself digging into the `Ratelimiter` class and seeing that it was both:

* Rather undocumented, and
* causing a *lot* of config checks

This PR attempts to refactor and comment the `Ratelimiter` class, as well as encourage config file accesses to only be done at instantiation. 

Best to be reviewed commit-by-commit.
2020-06-05 10:47:20 +01:00
Erik Johnston
35c308731d Speed up processing of federation stream RDATA rows.
Instead of storing and sending an ACK for every single row we send
synchronously, we instead do it asynchronously while batching up
updates.
2020-05-27 19:34:07 +01:00
Erik Johnston
eefc6b3a0d
Don't apply cache factor to event cache. (#7578)
This is already correctly done when we instansiate the cache, but wasn't
when it got reloaded (which always happens at least once on startup).
2020-05-27 12:04:37 +01:00
Richard van der Hoff
a0f99f81b3
Fix stacktrace mangling in patch_inline_callbacks (#7554)
`Failure()` is more cunning than `Failure(e)`.
2020-05-22 10:17:36 +01:00
Richard van der Hoff
d4676910c9 remove miscellaneous PY2 code 2020-05-15 19:37:41 +01:00
Richard van der Hoff
65902e08c3 remove to_ascii
this is a no-op on python 3.
2020-05-15 19:12:03 +01:00
Richard van der Hoff
08fa96f030 Remove exception_to_unicode
this is a no-op on python 3.
2020-05-15 19:07:24 +01:00
Patrick Cloke
56b66db78a
Strictly enforce canonicaljson requirements in a new room version (#7381) 2020-05-14 13:24:01 -04:00
Amber Brown
7cb8b4bc67
Allow configuration of Synapse's cache without using synctl or environment variables (#6391) 2020-05-11 18:45:23 +01:00
Erik Johnston
f9073893af Speed up fetching device lists changes in sync.
Currently we copy `users_who_share_room` needlessly about three times,
which is expensive when the set is large (which it can easily be).
2020-05-05 17:40:29 +01:00
Richard van der Hoff
13683a3a22
Extend StreamChangeCache to support multiple entities per stream ID (#7303)
First some background: StreamChangeCache is used to keep track of what "entities" have 
changed since a given stream ID. So for example, we might use it to keep track of when the last
to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2].

Now, it turns out that StreamChangeCache didn't support more than one thing being changed at
a given stream_id (this was part of the problem with #7206). However, it's entirely valid to send
to-device messages to more than one user at a time.

As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped
ok with having multiple things changing on the same stream ID, and it seems we never actually
use the methods which don't work on the stream change caches where we allow multiple
changes at the same stream ID. But that feels horribly fragile, hence: let's update
StreamChangeCache to properly support this, and add some typing and some more tests while
we're at it.

[1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301
[2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
2020-04-22 13:45:40 +01:00
Richard van der Hoff
0f8f02bc39
On catchup, process each row with its own stream id (#7286)
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
2020-04-20 11:43:29 +01:00
Richard van der Hoff
7966a1cde9
Rewrite prune_old_outbound_device_pokes for efficiency (#7159)
make sure we clear out all but one update for the user
2020-03-30 19:06:52 +01:00
Richard van der Hoff
39230d2171
Clean up some LoggingContext stuff (#7120)
* Pull Sentinel out of LoggingContext

... and drop a few unnecessary references to it

* Factor out LoggingContext.current_context

move `current_context` and `set_context` out to top-level functions.

Mostly this means that I can more easily trace what's actually referring to
LoggingContext, but I think it's generally neater.

* move copy-to-parent into `stop`

this really just makes `start` and `stop` more symetric. It also means that it
behaves correctly if you manually `set_log_context` rather than using the
context manager.

* Replace `LoggingContext.alive` with `finished`

Turn `alive` into `finished` and make it a bit better defined.
2020-03-24 14:45:33 +00:00
Patrick Cloke
509e381afa
Clarify list/set/dict/tuple comprehensions and enforce via flake8 (#6957)
Ensure good comprehension hygiene using flake8-comprehensions.
2020-02-21 07:15:07 -05:00
Erik Johnston
ed630ea17c
Reduce amount of logging at INFO level. (#6862)
A lot of the things we log at INFO are now a bit superfluous, so lets
make them DEBUG logs to reduce the amount we log by default.

Co-Authored-By: Brendan Abolivier <babolivier@matrix.org>
Co-authored-by: Brendan Abolivier <github@brendanabolivier.com>
2020-02-06 13:31:05 +00:00
Erik Johnston
ae5b3104f0
Fix stacktraces when using ObservableDeferred and async/await (#6836) 2020-02-03 17:10:54 +00:00
Andrew Morgan
9f7aaf90b5
Validate client_secret parameter (#6767) 2020-01-24 14:28:40 +00:00
Richard van der Hoff
acc7820574 Log saml assertions rather than the whole response
... since the whole response is huge.

We even need to break up the assertions, since kibana otherwise truncates them.
2020-01-16 22:26:34 +00:00
Richard van der Hoff
14d8f342d5 move batch_iter to a separate module 2020-01-16 22:25:32 +00:00
Richard van der Hoff
01243b98e1 Handle config not being set for synapse plugin modules
Some modules don't need any config, so having to define a `config` property
just to keep the loader happy is a bit annoying.
2020-01-12 21:34:36 +00:00
Richard van der Hoff
bc7de87650
Persist auth/state events at backwards extremities when we fetch them (#6526)
The main point here is to make sure that the state returned by _get_state_in_room has been authed before we try to use it as state in the room.
2019-12-16 12:26:28 +00:00
Hubert Chathi
cb2db17994
look up cross-signing keys from the DB in bulk (#6486) 2019-12-12 12:03:28 -05:00
Erik Johnston
f166a8d1f5 Remove SnapshotCache in favour of ResponseCache 2019-12-09 13:42:49 +00:00
Richard van der Hoff
18660a34d8
Fix inaccurate per-block metrics (#6491)
`Measure` incorrectly assumed that it was the only thing being done by the parent `LoggingContext`. For instance, during a "renew group attestations" operation, hundreds of `outbound_request` calls could take place in parallel, all using the same `LoggingContext`. This would mean that any resources used during *any* of those calls would be reported against *all* of them, producing wildly inaccurate results.

Instead, we now give each `Measure` block its own `LoggingContext` (using the parent `LoggingContext` mechanism to ensure that the log lines look correct and that the metrics are ultimately propogated to the top level for reporting against requests/backgrond tasks).
2019-12-09 11:55:30 +00:00
Erik Johnston
8437e2383e Port SyncHandler to async/await 2019-12-05 17:58:25 +00:00
Andrew Morgan
bc29a19731 Replace instance variations of homeserver with correct case/spacing 2019-11-12 13:08:12 +00:00
V02460
affcc2cc36 Fix LruCache callback deduplication (#6213) 2019-11-07 09:43:51 +00:00
Andrew Morgan
54fef094b3
Remove usage of deprecated logger.warn method from codebase (#6271)
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
2019-10-31 10:23:24 +00:00
Erik Johnston
6e677403b7 Clarify docstring 2019-10-30 11:52:04 +00:00
Erik Johnston
326b3dace7 Make ObservableDeferred.observe() always return deferred.
This makes it easier to use in an async/await world.

Also fixes a bug where cache descriptors would occaisonally return a raw
value rather than a deferred.
2019-10-30 11:35:46 +00:00
Andrew Morgan
b39ca49db1
Handle FileNotFound error in checking git repository version (#6284) 2019-10-30 11:00:15 +00:00
Erik Johnston
09a135b039 Make concurrently_execute work with async/await 2019-10-29 15:02:23 +00:00
Erik Johnston
e6c7e239ef Update docstring 2019-10-29 11:48:30 +00:00
Erik Johnston
d0d8a22c13 Quick fix to ensure cache descriptors always return deferreds 2019-10-28 13:33:04 +00:00
Erik Johnston
3c2d6c708c Add maybe_awaitable and fix __init__ bugs 2019-10-11 15:26:09 +01:00
Erik Johnston
fe1c1e6c28
Fixup comments
Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2019-10-10 13:17:19 +01:00
Erik Johnston
59e0ed8306 Fix py3.5 2019-10-10 12:47:07 +01:00
Erik Johnston
c349e3ebaf Fix py3.5 2019-10-10 12:29:38 +01:00
Erik Johnston
f735aeec65 sort 2019-10-10 12:20:29 +01:00
Erik Johnston
941edad583 Appease mypy 2019-10-10 12:15:17 +01:00
Erik Johnston
791a8c559b Add coments 2019-10-10 11:53:57 +01:00
Erik Johnston
ec0596f2ab Log correct context 2019-10-10 11:11:38 +01:00
Erik Johnston
3e4272961a Test for sentinel commit 2019-10-10 10:58:32 +01:00
Erik Johnston
1d6dd1c294 Move patch_inline_callbacks into synapse/ 2019-10-10 10:53:06 +01:00
Richard van der Hoff
66537e10ce
add some metrics on the federation sender (#6160) 2019-10-03 17:47:20 +01:00
Amber Brown
864f144543
Fix up some typechecking (#6150)
* type checking fixes

* changelog
2019-10-02 05:29:01 -07:00
Erik Johnston
f44f1d2e83 Fix errors storing large retry intervals.
We have set the max retry interval to a value larger than a postgres or
sqlite int can hold, which caused exceptions when updating the
destinations table.

To fix postgres we need to change the column to a bigint, and for sqlite
we lower the max interval to 2**62 (which is still incredibly long).
2019-10-02 10:36:27 +01:00
Richard van der Hoff
284e1cb027 Merge branch 'develop' into rav/fix_attribute_mapping 2019-09-19 20:32:25 +01:00
Richard van der Hoff
b74606ea22 Fix a bug with saml attribute maps.
Fixes a bug where the default attribute maps were prioritised over
user-specified ones, resulting in incorrect mappings.

The problem is that if you call SPConfig.load() multiple times, it adds new
attribute mappers to a list. So by calling it with the default config first,
and then the user-specified config, we would always get the default mappers
before the user-specified mappers.

To solve this, let's merge the config dicts first, and then pass them to
SPConfig.
2019-09-19 20:32:14 +01:00
Richard van der Hoff
1e19ce00bf
Add 'failure_ts' column to 'destinations' table (#6016)
Track the time that a server started failing at, for general analysis purposes.
2019-09-17 11:41:54 +01:00
Richard van der Hoff
3d882a7ba5
Remove the cap on federation retry interval. (#6026)
Essentially the intention here is to end up blacklisting servers which never
respond to federation requests.

Fixes https://github.com/matrix-org/synapse/issues/5113.
2019-09-12 13:00:13 +01:00
Richard van der Hoff
0388beafe4
Fix bug in calculating the federation retry backoff period (#6025)
This was intended to introduce an element of jitter; instead it gave you a
30/60 chance of resetting to zero.
2019-09-12 12:59:43 +01:00
Andrew Morgan
9fc71dc5ee
Use the v2 Identity Service API for lookups (MSC2134 + MSC2140) (#5976)
This is a redo of https://github.com/matrix-org/synapse/pull/5897 but with `id_access_token` accepted.

Implements [MSC2134](https://github.com/matrix-org/matrix-doc/pull/2134) plus Identity Service v2 authentication ala [MSC2140](https://github.com/matrix-org/matrix-doc/pull/2140).

Identity lookup-related functions were also moved from `RoomMemberHandler` to `IdentityHandler`.
2019-09-11 16:02:42 +01:00
Richard van der Hoff
7902bf1e1d
Clean up some code in the retry logic (#6017)
* remove some unused code
* make things which were constants into constants for efficiency and clarity
2019-09-11 15:14:56 +01:00
Andrew Morgan
3057095a5d Revert "Use the v2 lookup API for 3PID invites (#5897)" (#5937)
This reverts commit 71fc04069a.

This broke 3PID invites as #5892 was required for it to work correctly.
2019-08-30 12:00:20 +01:00
Andrew Morgan
71fc04069a
Use the v2 lookup API for 3PID invites (#5897)
Fixes https://github.com/matrix-org/synapse/issues/5861

Adds support for the v2 lookup API as defined in [MSC2134](https://github.com/matrix-org/matrix-doc/pull/2134). Currently this is only used for 3PID invites.

Sytest PR: https://github.com/matrix-org/sytest/pull/679
2019-08-28 14:59:26 +02:00
Erik Johnston
17e1e80726 Retry well-known lookup before expiry.
This gives a bit of a grace period where we can attempt to refetch a
remote `well-known`, while still using the cached result if that fails.

Hopefully this will make the well-known resolution a bit more torelant
of failures, rather than it immediately treating failures as "no result"
and caching that for an hour.
2019-08-13 16:20:38 +01:00
Brendan Abolivier
244953be3f
Add kwargs and doc 2019-07-29 10:03:14 +02:00
Brendan Abolivier
08352d44f8
Add ability to pass arguments to looping calls 2019-07-29 09:54:37 +02:00
Richard van der Hoff
618bd1ee76
Fix some error cases in the caching layer. (#5749)
There was some inconsistent behaviour in the caching layer around how
exceptions were handled - particularly synchronously-thrown ones.

This seems to be most easily handled by pushing the creation of
ObservableDeferreds down from CacheDescriptor to the Cache.
2019-07-25 15:59:45 +01:00
Richard van der Hoff
418635e68a
Add a prometheus metric for active cache lookups. (#5750)
* Add a prometheus metric for active cache lookups.

* changelog
2019-07-24 11:33:13 +01:00
Amber Brown
4806651744
Replace returnValue with return (#5736) 2019-07-23 23:00:55 +10:00
Erik Johnston
5ea773c505 Cache get_version_string.
The version of a module isn't going to change over the lifetime of the
process (assuming no funky hot reloading is going on, which it isn't),
so let's just cache the result to avoid spawning lots of git
subprocesses.

Fixes #5672.
2019-07-22 13:15:08 +01:00
Richard van der Hoff
9481707a52
Fixes to the federation rate limiter (#5621)
- Put the default window_size back to 1000ms (broken by #5181)
- Make the `rc_federation` config actually do something
- fix an off-by-one error in the 'concurrent' limit
- Avoid creating an unused `_PerHostRatelimiter` object for every single
  incoming request
2019-07-05 11:10:19 +01:00
Amber Brown
1ee268d33d
Improve the backwards compatibility re-exports of synapse.logging.context (#5617)
* Improve the backwards compatibility re-exports of synapse.logging.context.

* reexport logformatter too
2019-07-05 02:32:02 +10:00
Amber Brown
463b072b12
Move logging utilities out of the side drawer of util/ and into logging/ (#5606) 2019-07-04 00:07:04 +10:00
Richard van der Hoff
cb8d568cf9 Fix 'utime went backwards' errors on daemonization. (#5609)
* Fix 'utime went backwards' errors on daemonization.

Fixes #5608

* remove spurious debug
2019-07-03 22:40:45 +10:00
Richard van der Hoff
91753cae59
Fix a number of "Starting txn from sentinel context" warnings (#5605)
Fixes #5602, #5603
2019-07-03 09:31:27 +01:00
Amber Brown
0ee9076ffe Fix media repo breaking (#5593) 2019-07-02 19:01:28 +01:00
Andrew Morgan
ef8c62758c
Prevent multiple upgrades on the same room at once (#5051)
Closes #4583

Does slightly less than #5045, which prevented a room from being upgraded multiple times, one after another. This PR still allows that, but just prevents two from happening at the same time.

Mostly just to mitigate the fact that servers are slow and it can take a moment for the room upgrade to actually complete. We don't want people sending another request to upgrade the room when really they just thought the first didn't go through.
2019-06-25 14:19:21 +01:00
Richard van der Hoff
dc94773e60 Avoid raising exceptions in metrics
Sentry will catch the errors if they happen, so that should be good enough, and
woun't make things explode if we hit the error condition.
2019-06-24 10:01:16 +01:00
Richard van der Hoff
5097aee740 Merge branch 'develop' into rav/cleanup_metrics 2019-06-24 10:00:13 +01:00
Amber Brown
32e7c9e7f2
Run Black. (#5482) 2019-06-20 19:32:02 +10:00
Richard van der Hoff
fe641df770 Sanity-checking for metrics updates
Check that our clocks go forward.
2019-06-19 21:18:38 +01:00
Richard van der Hoff
aa530e6800
Call RetryLimiter correctly (#5340)
Fixes a regression introduced in #5335.
2019-06-04 22:02:53 +01:00
Richard van der Hoff
dce6e9e0c1 Avoid rapidly backing-off a server if we ignore the retry interval 2019-06-03 23:58:42 +01:00
Richard van der Hoff
3dcf2feba8
Improve logging for logcontext leaks. (#5288) 2019-05-29 19:27:50 +01:00
Amber Brown
f1e5b41388
Make all the rate limiting options more consistent (#5181) 2019-05-15 12:06:04 -05:00
Erik Johnston
0aba6c8251
Merge pull request #5183 from matrix-org/erikj/async_serialize_event
Allow client event serialization to be async
2019-05-15 10:36:30 +01:00
Erik Johnston
8ed2f182f7
Update docstring with correct return type
Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2019-05-15 09:52:52 +01:00
Richard van der Hoff
daa2fb6317 comment about user_joined_room 2019-05-14 18:53:09 +01:00
Erik Johnston
b54b03f9e1 Allow client event serialization to be async 2019-05-14 11:58:01 +01:00
Richard van der Hoff
836d3adcce Merge branch 'master' into develop 2019-05-03 19:25:01 +01:00
Richard van der Hoff
247dc1bd0b Use SystemRandom for token generation 2019-05-03 13:02:55 +01:00
Andrew Morgan
caa76e6021
Remove periods from copyright headers (#5046) 2019-04-11 17:08:13 +01:00
Richard van der Hoff
329688c161
Fix disappearing exceptions in manhole. (#5035)
Avoid sending syntax errors from the manhole to sentry.
2019-04-10 07:23:48 +01:00
Richard van der Hoff
bc5f6e1797
Add a caching layer to .well-known responses (#4516) 2019-01-30 10:55:25 +00:00
Richard van der Hoff
457fbfaf22
Merge pull request #4486 from xperimental/workaround-4216
Implement workaround for login error.
2019-01-30 07:06:11 +00:00
Robert Jacob
2a7f0b8953 Implement workaround for login error.
Signed-off-by: Robert Jacob <xperimental@solidproject.de>
2019-01-30 01:06:39 +01:00
Amber Brown
f815bd7feb
Make linearizer more quiet (#4507) 2019-01-29 11:05:31 +00:00
Richard van der Hoff
676cf2ee26
Fix incorrect logcontexts after a Deferred was cancelled (#4407) 2019-01-17 14:00:23 +00:00
Richard van der Hoff
ecc23188f4
Fix UnicodeDecodeError when postgres is not configured in english (#4253)
This is a bit of a half-assed effort at fixing https://github.com/matrix-org/synapse/issues/4252. Fundamentally the right answer is to drop support for Python 2.
2018-12-04 11:55:52 +01:00
Erik Johnston
b94a43d5b5 Merge branch 'develop' of github.com:matrix-org/synapse into erikj/alias_disallow_list 2018-10-25 15:25:31 +01:00
Richard van der Hoff
5c445114d3
Correctly account for cpu usage by background threads (#4074)
Wrap calls to deferToThread() in a thing which uses a child logcontext to
attribute CPU usage to the right request.

While we're in the area, remove the logcontext_tracer stuff, which is never
used, and afaik doesn't work.

Fixes #4064
2018-10-23 13:12:32 +01:00
Amber Brown
e1728dfcbe
Make scripts/ and scripts-dev/ pass pyflakes (and the rest of the codebase on py3) (#4068) 2018-10-20 11:16:55 +11:00
Amber Brown
e404ba9aac
Fix manhole on py3 (pt 2) (#4067) 2018-10-19 22:26:00 +11:00
Erik Johnston
9fafdfa97d Anchor returned regex to start and end of string 2018-10-19 10:22:45 +01:00
Erik Johnston
084046456e Add config option to control alias creation 2018-10-19 10:22:45 +01:00
Amber Brown
a36b0ec195 make a bytestring 2018-10-19 09:24:00 +11:00
Erik Johnston
6982320572 Remove unnecessary extra function call layer 2018-10-08 14:06:19 +01:00
Erik Johnston
8a1817f0d2 Use errback pattern and catch async failures 2018-10-08 13:29:47 +01:00
Erik Johnston
f7199e8734 Log looping call exceptions
If a looping call function errors, then it kills the loop entirely.
Currently it throws away the exception logs, so we should make it
actually log them.

Fixes #3929
2018-10-05 11:24:12 +01:00
Erik Johnston
4f3e3ac192 Correctly match 'dict.pop' api 2018-10-01 12:25:27 +01:00
Erik Johnston
8ea887856c Don't update eviction metrics on explicit removal 2018-10-01 12:00:58 +01:00
Richard van der Hoff
9c8cec5dab Merge remote-tracking branch 'origin/develop' into erikj/destination_retry_cache 2018-09-28 10:51:09 +01:00
Richard van der Hoff
4a15a3e4d5
Include eventid in log lines when processing incoming federation transactions (#3959)
when processing incoming transactions, it can be hard to see what's going on,
because we process a bunch of stuff in parallel, and because we may end up
recursively working our way through a chain of three or four events.

This commit creates a way to use logcontexts to add the relevant event ids to
the log lines.
2018-09-27 11:25:34 +01:00
Richard van der Hoff
5b4028fa78 Merge branch 'rav/fix_expiring_cache_len' into erikj/destination_retry_cache 2018-09-26 12:55:53 +01:00
Richard van der Hoff
7ee94fc1ba Log which cache is throwing exceptions 2018-09-26 12:43:08 +01:00
Erik Johnston
3baf6e1667 Fix ExpiringCache.__len__ to be accurate
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.

(This appears to have been a longstanding bug, but one which has been made more
of a problem by #3932 and #3933).

(This was originally done by Erik as part of #3933. I'm cherry-picking it
because really it's a fix in its own right)
2018-09-26 12:32:29 +01:00
Erik Johnston
19dc676d1a Fix ExpiringCache.__len__ to be accurate
It used to try and produce an estimate, which was sometimes negative.
This caused metrics to be sad, so lets always just calculate it from
scratch.
2018-09-21 16:25:42 +01:00
Erik Johnston
fdd1a62e8d Add a five minute cache to get_destination_retry_timings
Hopefully helps with #3931
2018-09-21 14:56:12 +01:00
Erik Johnston
79eded1ae4 Make ExpiringCache slightly more performant 2018-09-21 14:52:21 +01:00
Erik Johnston
8601c24287 Fix some instances of ExpiringCache not expiring cache items
ExpiringCache required that `start()` be called before it would actually
start expiring entries. A number of places didn't do that.

This PR removes `start` from ExpiringCache, and automatically starts
backround reaping process on creation instead.
2018-09-21 14:19:46 +01:00
Richard van der Hoff
642199570c
Improve the logging when handling a federation transaction (#3904)
Let's try to rationalise the logging that happens when we are processing an
incoming transaction, to make it easier to figure out what is going wrong when
they take ages. In particular:

- make everything start with a [room_id event_id] prefix
- make sure we log a warning when catching exceptions rather than just turning
  them into other, more cryptic, exceptions.
2018-09-19 17:28:18 +01:00
Erik Johnston
9407bcf37a Replace custom DeferredTimeoutError with defer.TimeoutError 2018-09-19 11:07:29 +01:00
Erik Johnston
6c48aa0256 Run canceller first to allow it to generate correct error 2018-09-19 11:07:27 +01:00
Erik Johnston
a334e1cace Update to use new timeout function everywhere.
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.

The new helper function doesn't suffer from this problem.
2018-09-19 10:39:40 +01:00
Erik Johnston
24efb2a70d Fix timeout function
Turns out deferred.cancel sometimes throws, so we do that last to ensure
that we always do resolve the new deferred.
2018-09-15 11:38:39 +01:00
Erik Johnston
fcfe7a850d Add an awful secondary timeout to fix wedged requests
This is an attempt to mitigate #3842 by adding yet-another-timeout
2018-09-14 19:23:07 +01:00
Erik Johnston
0a81038ea0 Add in flight real time metrics for Measure blocks 2018-09-14 15:08:37 +01:00
Erik Johnston
9e05c8d309 Change the manhole SSH key to have more bits
Newer versions of openssh client refuse to connect to the old key due to
its length.
2018-09-11 10:42:10 +01:00
Richard van der Hoff
be6527325a Fix exceptions when a connection is closed before we read the headers
This fixes bugs introduced in #3700, by making sure that we behave sanely
when an incoming connection is closed before the headers are read.
2018-08-20 18:21:10 +01:00
Richard van der Hoff
55e6bdf287 Robustness fix for logcontext filter
Make the logcontext filter not explode if it somehow ends up with a logcontext
of None, since that infinite-loops the whole logging system.
2018-08-20 18:20:07 +01:00
Amber Brown
324525f40c
Port over enough to get some sytests running on Python 3 (#3668) 2018-08-20 23:54:49 +10:00
Richard van der Hoff
c31793a784 Merge branch 'rav/fix_linearizer_cancellation' into develop 2018-08-10 14:57:27 +01:00
Amber Brown
b37c472419
Rename async to async_helpers because async is a keyword on Python 3.7 (#3678) 2018-08-10 23:50:21 +10:00
Richard van der Hoff
638d35ef08 Fix linearizer cancellation on twisted < 18.7
Turns out that cancellation of inlineDeferreds didn't really work properly
until Twisted 18.7. This commit refactors Linearizer.queue to avoid
inlineCallbacks.
2018-08-10 10:59:09 +01:00
Amber Brown
da7785147d
Python 3: Convert some unicode/bytes uses (#3569) 2018-08-02 00:54:06 +10:00
Richard van der Hoff
a8cbce0ced fix invalidation 2018-07-27 16:17:17 +01:00
Richard van der Hoff
f102c05856 Rewrite cache list decorator
Because it was complicated and annoyed me. I suspect this will be more
efficient too.
2018-07-27 13:47:04 +01:00
Richard van der Hoff
03751a6420 Fix some looping_call calls which were broken in #3604
It turns out that looping_call does check the deferred returned by its
callback, and (at least in the case of client_ips), we were relying on this,
and I broke it in #3604.

Update run_as_background_process to return the deferred, and make sure we
return it to clock.looping_call.
2018-07-26 11:48:08 +01:00
Richard van der Hoff
3d6df84658 Test and fix support for cancellation in Linearizer 2018-07-20 13:59:55 +01:00
Richard van der Hoff
7c712f95bb Combine Limiter and Linearizer
Linearizer was effectively a Limiter with max_count=1, so rather than
maintaining two sets of code, let's combine them.
2018-07-20 13:11:43 +01:00
Richard van der Hoff
8462c26485 Improvements to the Limiter
* give them names, to improve logging
* use a deque rather than a list for efficiency
2018-07-20 12:50:27 +01:00
Richard van der Hoff
d7275eecf3 Add a sleep to the Limiter to fix stack overflows.
Fixes #3570
2018-07-20 12:37:12 +01:00
Amber Brown
95ccb6e2ec
Don't spew errors because we can't save metrics (#3563) 2018-07-19 20:58:18 +10:00
Richard van der Hoff
8c69b735e3 Make Distributor run its processes as a background process
This is more involved than it might otherwise be, because the current
implementation just drops its logcontexts and runs everything in the sentinel
context.

It turns out that we aren't actually using a bunch of the functionality here
(notably suppress_failures and the fact that Distributor.fire returns a
deferred), so the easiest way to fix this is actually by simplifying a bunch of
code.
2018-07-18 20:55:05 +01:00
Richard van der Hoff
667fba68f3 Run things as background processes
This fixes #3518, and ensures that we get useful logs and metrics for lots of
things that happen in the background.

(There are certainly more things that happen in the background; these are just
the common ones I've found running a single-process synapse locally).
2018-07-18 20:55:05 +01:00
Erik Johnston
b2aa05a8d6 Use efficient .intersection 2018-07-17 11:07:04 +01:00
Erik Johnston
547b1355d3 Fix perf regression in PR #3530
The get_entities_changed function was changed to return all changed
entities since the given stream position, rather than only those changed
from a given list of entities. This resulted in the function incorrectly
returning large numbers of entities that, for example, caused large
increases in database usage.
2018-07-17 10:27:51 +01:00
Amber Brown
3fe0938b76
Merge pull request #3530 from matrix-org/erikj/stream_cache
Don't return unknown entities in get_entities_changed
2018-07-17 13:44:46 +10:00
Richard van der Hoff
33b40d0a25 Make FederationRateLimiter queue requests properly
popitem removes the *most recent* item by default [1]. We want the oldest.

Fixes #3524

[1]: https://docs.python.org/2/library/collections.html#collections.OrderedDict.popitem
2018-07-13 16:19:40 +01:00
Erik Johnston
77b692e65d Don't return unknown entities in get_entities_changed
The stream cache keeps track of all entities that have changed since
a particular stream position, so get_entities_changed does not need to
return unknown entites when given a larger stream position.

This makes it consistent with the behaviour of has_entity_changed.
2018-07-13 15:26:10 +01:00
Richard van der Hoff
fa5c2bc082 Reduce set building in get_entities_changed
This line shows up as about 5% of cpu time on a synchrotron:

    not_known_entities = set(entities) - set(self._entity_to_key)

Presumably the problem here is that _entity_to_key can be largeish, and
building a set for its keys every time this function is called is slow.

Here we rewrite the logic to avoid building so many sets.
2018-07-12 11:37:44 +01:00
Richard van der Hoff
c3c29aa196
Attempt to include db threads in cpu usage stats (#3496)
Let's try to include time spent in the DB threads in the per-request/block cpu
usage metrics.
2018-07-10 16:12:36 +01:00
Richard van der Hoff
55370331da
Refactor logcontext resource usage tracking (#3501)
Factor out the resource usage tracking out to a separate object, which can be
passed around and copied independently of the logcontext itself.
2018-07-10 13:56:07 +01:00
Amber Brown
49af402019 run isort 2018-07-09 16:09:20 +10:00
Amber Brown
6350bf925e
Attempt to be more performant on PyPy (#3462) 2018-06-28 14:49:57 +01:00
Amber Brown
72d2143ea8
Revert "Revert "Try to not use as much CPU in the StreamChangeCache"" (#3454) 2018-06-28 11:04:18 +01:00
Matthew Hodgson
8057489b26
Revert "Try to not use as much CPU in the StreamChangeCache" 2018-06-26 18:09:01 +01:00
Amber Brown
1202508067 fixes 2018-06-26 17:29:01 +01:00
Amber Brown
bd3d329c88 fixes 2018-06-26 17:28:12 +01:00
Amber Brown
abfe4b2957 try and make loading items from the cache faster 2018-06-26 17:25:34 +01:00
Amber Brown
07cad26d65
Remove all global reactor imports & pass it around explicitly (#3424) 2018-06-25 14:08:28 +01:00
Richard van der Hoff
43e02c409d Disable partial state group caching for wildcard lookups
When _get_state_for_groups is given a wildcard filter, just do a complete
lookup. Hopefully this will give us the best of both worlds by not filling up
the ram if we only need one or two keys, but also making the cache still work
for the federation reader usecase.
2018-06-22 11:52:07 +01:00
Richard van der Hoff
70e6501913
Merge pull request #3419 from matrix-org/rav/events_per_request
Log number of events fetched from DB
2018-06-22 11:17:56 +01:00
Richard van der Hoff
0495fe0035 Indirect evt_count updates via method call
so that we can stub it for the sentinel and not have a billion failing UTs
2018-06-22 10:42:28 +01:00
Amber Brown
77ac14b960
Pass around the reactor explicitly (#3385) 2018-06-22 09:37:10 +01:00
Richard van der Hoff
b088aafcae Log number of events fetched from DB
When we finish processing a request, log the number of events we fetched from
the database to handle it.

[I'm trying to figure out which requests are responsible for large amounts of
event cache churn. It may turn out to be more helpful to add counts to the
prometheus per-request/block metrics, but that is an extension to this code
anyway.]
2018-06-21 06:15:03 +01:00
Amber Brown
a61738b316
Remove run_on_reactor (#3395) 2018-06-14 18:27:37 +10:00
Amber Brown
f7869f8f8b
Port to sortedcontainers (with tests!) (#3332) 2018-06-06 00:13:57 +10:00
Erik Johnston
042eedfa2b Add hacky cache factor override system 2018-06-04 15:39:28 +01:00
Amber Brown
c936a52a9e
Consistently use six's iteritems and wrap lazy keys/values in list() if they're not meant to be lazy (#3307) 2018-05-31 19:03:47 +10:00
Amber Brown
debff7ae09
Merge pull request #3281 from NotAFile/py3-six-isinstance
remaining isintance fixes
2018-05-30 12:44:46 +10:00
Adrian Tschira
7873cde526 pep8 2018-05-29 17:35:55 +02:00
Amber Brown
57ad76fa4a fix up tests 2018-05-28 19:51:53 +10:00
Amber Brown
3ef5cd74a6 update to more consistently use seconds in any metrics or logging 2018-05-28 19:39:27 +10:00
Amber Brown
357c74a50f add comment about why unreg 2018-05-28 19:14:41 +10:00
Amber Brown
754826a830 Merge remote-tracking branch 'origin/develop' into 3218-official-prom 2018-05-28 18:57:23 +10:00
Adrian Tschira
4ee4450d66 fix recursion error 2018-05-24 21:44:10 +02:00
Adrian Tschira
dd068ca979 remaining isintance fixes
Signed-off-by: Adrian Tschira <nota@notafile.com>
2018-05-24 20:55:08 +02:00
Amber Brown
36501068d8
Merge pull request #3247 from NotAFile/py3-misc
Misc Python3 fixes
2018-05-24 12:58:37 -05:00
Amber Brown
2aff6eab6d
Merge pull request #3245 from NotAFile/batch-iter
Add batch_iter to utils
2018-05-24 12:54:12 -05:00
Amber Brown
53cc2cde1f cleanup 2018-05-22 17:32:57 -05:00
Amber Brown
071206304d cleanup pep8 errors 2018-05-22 16:54:22 -05:00
Amber Brown
85ba83eb51 fixes 2018-05-22 16:28:23 -05:00
Amber Brown
a8990fa2ec Merge remote-tracking branch 'origin/develop' into 3218-official-prom 2018-05-22 10:50:26 -05:00
Erik Johnston
7948ecf234 Comment 2018-05-22 11:39:43 +01:00
Erik Johnston
020377a550 Fix logcontext resource usage tracking 2018-05-22 11:16:07 +01:00
Amber Brown
df9f72d9e5 replacing portions 2018-05-21 19:47:37 -05:00
Adrian Tschira
45b55e23d3 Add batch_iter to utils
There's a frequent idiom I noticed where an iterable is split up into a
number of chunks/batches. Unfortunately that method does not work with
iterators like dict.keys() in python3. This implementation works with
iterators.

Signed-off-by: Adrian Tschira <nota@notafile.com>
2018-05-19 17:48:30 +02:00
Adrian Tschira
73cbdef5f7 fix py3 intern and remove unnecessary py3 encode
Signed-off-by: Adrian Tschira <nota@notafile.com>
2018-05-19 17:35:31 +02:00
Richard van der Hoff
093d8c415a Merge remote-tracking branch 'origin/develop' into rav/warn_on_logcontext_fail 2018-05-03 14:59:29 +01:00
Richard van der Hoff
a7fe62f0cb Fix logcontext leaks in rate limiter 2018-05-03 12:31:59 +01:00
Richard van der Hoff
415c6b672e Merge branch 'develop' into rav/more_logcontext_leaks 2018-05-02 16:16:01 +01:00
Richard van der Hoff
f22e7cda2c Fix a class of logcontext leaks
So, it turns out that if you have a first `Deferred` `D1`, you can add a
callback which returns another `Deferred` `D2`, and `D2` must then complete
before any further callbacks on `D1` will execute (and later callbacks on `D1`
get the *result* of `D2` rather than `D2` itself).

So, `D1` might have `called=True` (as in, it has started running its
callbacks), but any new callbacks added to `D1` won't get run until `D2`
completes - so if you `yield D1` in an `inlineCallbacks` function, your `yield`
will 'block'.

In conclusion: some of our assumptions in `logcontext` were invalid. We need to
make sure that we don't optimise out the logcontext juggling when this
situation happens. Fortunately, it is easy to detect by checking `D1.paused`.
2018-05-02 11:58:00 +01:00
Richard van der Hoff
e482f8cd85 Fix incorrect reference to StringIO
This was introduced in 4f2f5171
2018-05-02 09:12:26 +01:00
Richard van der Hoff
fdb6849b81
Merge pull request #3144 from matrix-org/rav/run_in_background_exception_handling
Trap exceptions thrown within run_in_background
2018-04-30 10:23:02 +01:00
Richard van der Hoff
db75c86e84
Merge branch 'develop' into py3-xrange-1 2018-04-30 01:02:25 +01:00
Richard van der Hoff
049b0b5af2
Merge pull request #3154 from NotAFile/py3-stringio
Replace stringIO imports with six
2018-04-30 00:59:04 +01:00
Richard van der Hoff
dbf6f28d64
Merge pull request #3155 from NotAFile/py3-bytes-1
more bytes strings
2018-04-30 00:38:21 +01:00