Commit Graph

635 Commits

Author SHA1 Message Date
Erik Johnston
f112cfe5bb
Fix MultiWriteIdGenerator's handling of restarts. (#8374)
On startup `MultiWriteIdGenerator` fetches the maximum stream ID for
each instance from the table and uses that as its initial "current
position" for each writer. This is problematic as a) it involves either
a scan of events table or an index (neither of which is ideal), and b)
if rows are being persisted out of order elsewhere while the process
restarts then using the maximum stream ID is not correct. This could
theoretically lead to race conditions where e.g. events that are
persisted out of order are not sent down sync streams.

We fix this by creating a new table that tracks the current positions of
each writer to the stream, and update it each time we finish persisting
a new entry. This is a relatively small overhead when persisting events.
However for the cache invalidation stream this is a much bigger relative
overhead, so instead we note that for invalidation we don't actually
care about reliability over restarts (as there's no caches to
invalidate) and simply don't bother reading and writing to the new table
in that particular case.
2020-09-24 16:53:51 +01:00
Erik Johnston
ac11fcbbb8
Add EventStreamPosition type (#8388)
The idea is to remove some of the places we pass around `int`, where it can represent one of two things:

1. the position of an event in the stream; or
2. a token that partitions the stream, used as part of the stream tokens.

The valid operations are then:

1. did a position happen before or after a token;
2. get all events that happened before or after a token; and
3. get all events between two tokens.

(Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
2020-09-24 13:24:17 +01:00
Patrick Cloke
8a4a4186de
Simplify super() calls to Python 3 syntax. (#8344)
This converts calls like super(Foo, self) -> super().

Generated with:

    sed -i "" -Ee 's/super\([^\(]+\)/super()/g' **/*.py
2020-09-18 09:56:44 -04:00
Jonathan de Jong
a3f124b821
Switch metaclass initialization to python 3-compatible syntax (#8326) 2020-09-16 15:15:55 -04:00
Patrick Cloke
aec294ee0d
Use slots in attrs classes where possible (#8296)
slots use less memory (and attribute access is faster) while slightly
limiting the flexibility of the class attributes. This focuses on objects
which are instantiated "often" and for short periods of time.
2020-09-14 12:50:06 -04:00
Patrick Cloke
d2a3eb04a4 Fix typos in comments. 2020-09-14 11:46:58 -04:00
Erik Johnston
04cc249b43
Add experimental support for sharding event persister. Again. (#8294)
This is *not* ready for production yet. Caveats:

1. We should write some tests...
2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
2020-09-14 10:16:41 +01:00
Erik Johnston
5d3e306d9f
Clean up Notifier.on_new_room_event code path (#8288)
The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it.

This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens.

The valid operations here are:

1. Is a position before or after a token
2. Fetching all events between two tokens
3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C.

Future PR will change the token type to a dedicated type.
2020-09-10 13:24:43 +01:00
Patrick Cloke
2ea1c68249
Remove some unused distributor signals (#8216)
Removes the `user_joined_room` and stops calling it since there are no observers.

Also cleans-up some other unused signals and related code.
2020-09-09 12:22:00 -04:00
Erik Johnston
c9dbee50ae
Fixup pusher pool notifications (#8287)
`pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed.

I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
2020-09-09 16:56:08 +01:00
Erik Johnston
dc9dcdbd59 Revert "Fixup pusher pool notifications"
This reverts commit e7fd336a53.
2020-09-09 16:19:22 +01:00
Erik Johnston
e7fd336a53 Fixup pusher pool notifications 2020-09-09 16:17:50 +01:00
Patrick Cloke
c619253db8
Stop sub-classing object (#8249) 2020-09-04 06:54:56 -04:00
Brendan Abolivier
9f8abdcc38
Revert "Add experimental support for sharding event persister. (#8170)" (#8242)
* Revert "Add experimental support for sharding event persister. (#8170)"

This reverts commit 82c1ee1c22.

* Changelog
2020-09-04 10:19:42 +01:00
Erik Johnston
82c1ee1c22
Add experimental support for sharding event persister. (#8170)
This is *not* ready for production yet. Caveats:

1. We should write some tests...
2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
2020-09-02 15:48:37 +01:00
Richard van der Hoff
aa07c37cf0
Move and rename get_devices_with_keys_by_user (#8204)
* Move `get_devices_with_keys_by_user` to `EndToEndKeyWorkerStore`

this seems a better fit for it.

This commit simply moves the existing code: no other changes at all.

* Rename `get_devices_with_keys_by_user`

to better reflect what it does.

* get_device_stream_token abstract method

To avoid referencing fields which are declared in the derived classes, make
`get_device_stream_token` abstract, and define that in the classes which define
`_device_list_id_gen`.
2020-09-01 12:41:21 +01:00
Erik Johnston
3b4556cf87
Fix wait_for_stream_position for multiple waiters. (#8196)
This fixes a bug where having multiple callers waiting on the same
stream and position will cause it to try and compare two deferreds,
which fails (due to the sorted list having an entry of `Tuple[int,
Deferred]`).
2020-08-28 17:12:45 +01:00
Erik Johnston
e3c91a3c55
Make SlavedIdTracker.advance have same interface as MultiWriterIDGenerator (#8171) 2020-08-26 13:15:20 +01:00
Erik Johnston
c9c544cda5
Remove ChainedIdGenerator. (#8123)
It's just a thin wrapper around two ID gens to make `get_current_token`
and `get_next` return tuples. This can easily be replaced by calling the
appropriate methods on the underlying ID gens directly.
2020-08-19 13:41:51 +01:00
Patrick Cloke
eebf52be06
Be stricter about JSON that is accepted by Synapse (#8106) 2020-08-19 07:26:03 -04:00
Erik Johnston
76d21d14a0
Separate get_current_token into two. (#8113)
The function is used for two purposes: 1) for subscribers of streams to
get a token they can use to get further updates with, and 2) for
replication to track position of the writers of the stream.

For streams with a single writer the two scenarios produce the same
result, however the situation becomes complicated for streams with
multiple writers. The current `MultiWriterIdGenerator` does not
correctly handle the first case (which is not an issue as its only used
for the `caches` stream which nothing subscribes to outside of
replication).
2020-08-19 10:39:31 +01:00
Patrick Cloke
ac77cdb64e
Add a shadow-banned flag to users. (#8092) 2020-08-14 12:37:59 -04:00
David Vo
4dd27e6d11
Reduce unnecessary whitespace in JSON. (#7372) 2020-08-07 08:02:55 -04:00
Patrick Cloke
d4a7829b12
Convert synapse.api to async/await (#8031) 2020-08-06 08:30:06 -04:00
Erik Johnston
a7bdf98d01
Rename database classes to make some sense (#8033) 2020-08-05 21:38:57 +01:00
Patrick Cloke
3b415e23a5
Convert replication code to async/await. (#7987) 2020-08-03 07:12:55 -04:00
Richard van der Hoff
349119a340 Synapse 1.18.0rc2 (2020-07-28)
==============================
 
 Bugfixes
 --------
 
 - Fix an `AssertionError` exception introduced in v1.18.0rc1. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
 - Fix experimental support for moving typing off master when worker is restarted, which is broken in v1.18.0rc1. ([\#7967](https://github.com/matrix-org/synapse/issues/7967))
 
 Internal Changes
 ----------------
 
 - Further optimise queueing of inbound replication commands. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEv27Axt/F4vrTL/8QOSor00I9eP8FAl8f/f8ACgkQOSor00I9
 eP8/Uwf8CiVWvrBsmFZMvxJDkUWm0/f1kN4IQdm8ibDtyNyvFUx+Y1K8KOQS+VwG
 a3bZqSC2Vv2sO9O9kR+V2tk831l+ujO0Nlaohuqyvhcl9lzh04rRYI9x9IHlAq2H
 WPb0NMLwMufL6YkXDBwZT/G9TVW1vLRGASu4f7X2rXqek34VNVgYbg1hB2dp4dDa
 wjKk3iBZ6h34IhKPgu0sLBUcyvX4U5xdOHjEG3HXvNnvDNO0HMD8rGB7065vFMD6
 PH4nUK/h+RL0UBs2sJOMK1ZazFUODdURwANJQNAQ6pNvf9/RWgw2okka2bYIcmQQ
 UT7tiwMsBvKdy4PER5fcDX3COY16qw==
 =Q+bI
 -----END PGP SIGNATURE-----

Merge tag 'v1.18.0rc2' into develop

Synapse 1.18.0rc2 (2020-07-28)
==============================

Bugfixes
--------

- Fix an `AssertionError` exception introduced in v1.18.0rc1. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
- Fix experimental support for moving typing off master when worker is restarted, which is broken in v1.18.0rc1. ([\#7967](https://github.com/matrix-org/synapse/issues/7967))

Internal Changes
----------------

- Further optimise queueing of inbound replication commands. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
2020-07-28 11:31:31 +01:00
Erik Johnston
a8f7ed28c6
Typing worker needs to handle stream update requests (#7967)
IIRC this doesn't break tests because its only hit on reconnection, or something.

Basically, when a process needs to fetch missing updates for the `typing` stream it needs to query the writer instance via HTTP (as we don't write typing notifications to the DB), the problem was that the endpoint (`streams`) was only registered on master and specifically not on the typing writer worker.
2020-07-28 11:04:53 +01:00
Richard van der Hoff
f57b99af22
Handle replication commands synchronously where possible (#7876)
Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
2020-07-27 18:54:43 +01:00
Patrick Cloke
8553f46498
Convert a synapse.events to async/await. (#7949) 2020-07-27 13:40:22 -04:00
Erik Johnston
84d099ae11
Fix typing replication not being handled on master (#7959)
Handling of incoming typing stream updates from replication was not
hooked up on master, effecting set ups where typing was handled on a
different worker.

This is really only a problem if the master process is also handling
sync requests, which is unlikely for those that are at the stage of
moving typing off.

The other observable effect is that if a worker restarts or a
replication connect drops then the typing worker will issue a
`POSITION typing`, triggering master process to try and stream *all*
typing updates from position 0.

Fixes #7907
2020-07-27 14:10:53 +01:00
Richard van der Hoff
931b026844
Remove an unused prometheus metric (#7878) 2020-07-22 00:40:55 +01:00
Richard van der Hoff
05060e0223
Track command processing as a background process (#7879)
I'm going to be doing more stuff synchronously, and I don't want to lose the
CPU metrics down the sofa.
2020-07-22 00:40:42 +01:00
Karthikeyan Singaravelan
a7b06a81f0
Fix deprecation warning: import ABC from collections.abc (#7892) 2020-07-20 13:33:04 -04:00
Erik Johnston
2d2acc1cf2
Stop using 'device_max_stream_id' (#7882)
It serves no purpose and updating everytime we write to the device inbox
stream means all such transactions will conflict, causing lots of
transaction failures and retries.
2020-07-17 17:03:27 +01:00
Richard van der Hoff
e5300063ed
Optimise queueing of inbound replication commands (#7861)
When we get behind on replication, we tend to stack up background processes
behind a linearizer. Bg processes are heavy (particularly with respect to
prometheus metrics) and linearizers aren't terribly efficient once the queue
gets long either.

A better approach is to maintain a queue of requests to be processed, and
nominate a single process to work its way through the queue.

Fixes: #7444
2020-07-16 15:49:37 +01:00
Erik Johnston
f2e38ca867
Allow moving typing off master (#7869) 2020-07-16 15:12:54 +01:00
Erik Johnston
f299441cc6
Add ability to shard the federation sender (#7798) 2020-07-10 18:26:36 +01:00
Patrick Cloke
38e1fac886
Fix some spelling mistakes / typos. (#7811) 2020-07-09 09:52:58 -04:00
Richard van der Hoff
2ab0b021f1
Generate real events when we reject invites (#7804)
Fixes #2181. 

The basic premise is that, when we
fail to reject an invite via the remote server, we can generate our own
out-of-band leave event and persist it as an outlier, so that we have something
to send to the client.
2020-07-09 10:40:19 +01:00
Patrick Cloke
e7efd8f827
Do not use simplejson in Synapse. (#7800) 2020-07-08 07:15:08 -04:00
Erik Johnston
67d7756fcf
Refactor getting replication updates from database v2. (#7740) 2020-07-07 12:11:35 +01:00
Will Hunt
62b1ce8539
isort 5 compatibility (#7786)
The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
2020-07-05 16:32:02 +01:00
Erik Johnston
5cdca53aa0
Merge different Resource implementation classes (#7732) 2020-07-03 19:02:19 +01:00
Richard van der Hoff
f01e2ca039
Use symbolic names for replication stream names (#7768)
This makes it much easier to find where streams are referenced.
2020-07-01 16:35:40 +01:00
Erik Johnston
f6f7511a4c
Refactor getting replication updates from database. (#7636)
The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
2020-06-16 17:10:28 +01:00
Dagfinn Ilmari Mannsåker
a3f11567d9
Replace all remaining six usage with native Python 3 equivalents (#7704) 2020-06-16 08:51:47 -04:00
Patrick Cloke
7d2532be36
Discard RDATA from already seen positions. (#7648) 2020-06-15 08:44:54 -04:00
Erik Johnston
664409b169
Fix bug in account data replication stream. (#7656)
* Ensure account data stream IDs are unique.

The account data stream is shared between three tables, and the maximum
allocated ID was tracked in a dedicated table. Updating the max ID
happened outside the transaction that allocated the ID, leading to a
race where if the server was restarted then the same ID could be
allocated but the max ID failed to be updated, leading it to be reused.

The ID generators have support for tracking across multiple tables, so
we may as well use that instead of a dedicated table.

* Fix bug in account data replication stream.

If the same stream ID was used in both global and room account data then
the getting updates for the replication stream would fail due to
`heapq.merge(..)` trying to compare a `str` with a `None`. (This is
because you'd have two rows like `(534, '!room')` and `(534, None)` from
the room and global account data tables).

Fix is just to order by stream ID, since we don't rely on the ordering
beyond that. The bug where stream IDs can be reused should be fixed now,
so this case shouldn't happen going forward.

Fixes #7617
2020-06-09 16:28:57 +01:00
Patrick Cloke
f1e61ef85c Typo fixes. 2020-06-05 08:43:21 -04:00
Erik Johnston
9bac5d62b3
Ensure ReplicationStreamer is always started when replication enabled. (#7579)
Fixes #7566.
2020-05-27 11:44:19 +01:00
Erik Johnston
e5c67d04db
Add option to move event persistence off master (#7517) 2020-05-22 16:11:35 +01:00
Erik Johnston
1531b214fc
Add ability to wait for replication streams (#7542)
The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room).

Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on.

People probably want to look at this commit by commit.
2020-05-22 14:21:54 +01:00
Erik Johnston
51055c8c44
Allow ReplicationRestResource to be added to workers (#7515)
This allows workers to talk to each other over HTTP replication.
2020-05-18 12:24:48 +01:00
Richard van der Hoff
4d1afb1dfe
Merge pull request #7519 from matrix-org/rav/kill_py2_code
Kill off some old python 2 code
2020-05-18 10:45:30 +01:00
Richard van der Hoff
91f51c611c remove redundant __func__
this is a no-op under python 3
2020-05-15 19:37:41 +01:00
Richard van der Hoff
6c1f7c722f
Fix limit logic for AccountDataStream (#7384)
Make sure that the AccountDataStream presents complete updates, in the right
order.

This is much the same fix as #7337 and #7358, but applied to a different stream.
2020-05-15 19:03:25 +01:00
Erik Johnston
1f36ff69e8
Move event stream handling out of slave store. (#7491)
This allows us to have the logic on both master and workers, which is necessary to move event persistence off master.

We also combine the instantiation of ID generators from DataStore and slave stores to the base worker stores. This allows us to select which process writes events independently of the master/worker splits.
2020-05-15 16:43:59 +01:00
Erik Johnston
4734a7bbe4
Move EventStream handling into default ReplicationDataHandler (#7493)
This is so that the logic can happen on both master and workers when we move event persistence out.
2020-05-14 14:01:39 +01:00
Erik Johnston
1de36407d1
Add instance_map config and route replication calls (#7495) 2020-05-14 14:00:58 +01:00
Erik Johnston
7ee24c5674
Have all instances correctly respond to REPLICATE command. (#7475)
Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands.

Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
2020-05-13 10:27:02 +01:00
Erik Johnston
8ca79613e6
Fix Redis reconnection logic (#7482)
Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
2020-05-13 09:57:15 +01:00
Amber Brown
7cb8b4bc67
Allow configuration of Synapse's cache without using synctl or environment variables (#6391) 2020-05-11 18:45:23 +01:00
Andrew Morgan
5cf758cdd6 Merge branch 'release-v1.13.0' into develop
* release-v1.13.0:
  Don't UPGRADE database rows
  RST indenting
  Put rollback instructions in upgrade notes
  Fix changelog typo
  Oh yeah, RST
  Absolute URL it is then
  Fix upgrade notes link
  Provide summary of upgrade issues in changelog. Fix )
  Move next version notes from changelog to upgrade notes
  Changelog fixes
  1.13.0rc1
  Documentation on setting up redis (#7446)
  Rework UI Auth session validation for registration (#7455)
  Fix errors from malformed log line (#7454)
  Drop support for redis.dbid (#7450)
2020-05-11 16:46:33 +01:00
Richard van der Hoff
aa5aa6f96a
Fix errors from malformed log line (#7454) 2020-05-07 19:51:38 +01:00
Richard van der Hoff
da9b2db3af
Drop support for redis.dbid (#7450)
Since we only use pubsub, the dbid is irrelevant.
2020-05-07 16:46:15 +01:00
Erik Johnston
d7983b63a6
Support any process writing to cache invalidation stream. (#7436) 2020-05-07 13:51:08 +01:00
Richard van der Hoff
62ee862119 Merge branch 'release-v1.13.0' into develop 2020-05-06 15:56:03 +01:00
Richard van der Hoff
2e0c46ca07 Merge branch 'release-v1.13.0' into develop 2020-05-06 11:58:31 +01:00
Richard van der Hoff
a8c17da245 Merge branch 'release-v1.13.0' into rav/fix_dropped_messages 2020-05-05 23:01:12 +01:00
Richard van der Hoff
1242267316 Merge branch 'release-v1.13.0' into rav/fix_dropped_messages 2020-05-05 22:38:44 +01:00
Richard van der Hoff
7f7eedbebb Wait for a POSITION on the right connection before accepting RDATA
... otherwise we can believe we're up to date when we're not.
2020-05-05 22:38:16 +01:00
Brendan Abolivier
5b8023dc7f
Move logs about discarded RDATA to debug (#7421) 2020-05-05 21:07:33 +02:00
Richard van der Hoff
d78265af0c Wait to subscribe before sending REPLICATE 2020-05-05 19:31:37 +01:00
Richard van der Hoff
d5aa7d93ed
Fix catchup-on-reconnect for the Federation Stream (#7374)
looks like we managed to break this during the refactorathon.
2020-05-05 14:15:57 +01:00
Erik Johnston
350421e058
Fix redis password support. (#7401)
We forgot to set the password on the subscriber connection, as well as
not calling super methods for overridden connectionMade/connectionLost
functions.
2020-05-04 14:04:09 +01:00
Erik Johnston
0e719f2398
Thread through instance name to replication client. (#7369)
For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
2020-05-01 17:19:56 +01:00
Erik Johnston
3085cde577
Use stream.current_token() and remove stream_positions() (#7172)
We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
2020-05-01 15:21:35 +01:00
Richard van der Hoff
b2dba06079
Workaround for assertion errors from db_query_to_update_function (#7378)
Hopefully this is no worse than what we have on master...
2020-05-01 09:25:16 +01:00
Erik Johnston
37f6823f5b
Add instance name to RDATA/POSITION commands (#7364)
This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
2020-04-29 16:23:08 +01:00
Erik Johnston
3eab76ad43
Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)
For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.

The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.

Fixes #7334.
2020-04-29 14:10:59 +01:00
Richard van der Hoff
c2e1a2110f
Fix limit logic for EventsStream (#7358)
* Factor out functions for injecting events into database

I want to add some more flexibility to the tools for injecting events into the
database, and I don't want to clutter up HomeserverTestCase with them, so let's
factor them out to a new file.

* Rework TestReplicationDataHandler

This wasn't very easy to work with: the mock wrapping was largely superfluous,
and it's useful to be able to inspect the received rows, and clear out the
received list.

* Fix AssertionErrors being thrown by EventsStream

Part of the problem was that there was an off-by-one error in the assertion,
but also the limit logic was too simple. Fix it all up and add some tests.
2020-04-29 12:30:36 +01:00
Erik Johnston
38919b521e
Run replication streamers on workers (#7146)
Currently we never write to streams from workers, but that will change soon
2020-04-28 13:34:12 +01:00
Richard van der Hoff
ce428a1abe Fix EventsStream raising assertions when it falls behind
Figuring out how to correctly limit updates from this stream without dropping
entries is far more complicated than just counting the number of rows being
returned. We need to consider each query separately and, if any one query hits
the limit, truncate the results from the others.

I think this also fixes some potentially long-standing bugs where events or
state changes could get missed if we hit the limit on either query.
2020-04-24 13:59:21 +01:00
Richard van der Hoff
9cbdfb3a2f Make it clear that the limit for an update_function is a target 2020-04-23 15:45:12 +01:00
Richard van der Hoff
23b28266ac Remove 'limit' param from get_repl_stream_updates API
there doesn't seem to be much point in passing this limit all around, since
both sides agree it's meant to be 100.
2020-04-23 15:44:35 +01:00
Richard van der Hoff
71a1abb8a1
Stop the master relaying USER_SYNC for other workers (#7318)
Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication.

In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits.

Fixes (I hope) #7257.
2020-04-22 22:39:04 +01:00
Erik Johnston
841c581c40
Fix replication metrics when using redis (#7325) 2020-04-22 16:26:19 +01:00
Richard van der Hoff
82d8b1dd1f
Another go at fixing one-word commands (#7326)
I messed this up last time I tried (#7239 / e13c6c7).
2020-04-22 14:34:31 +01:00
Erik Johnston
51f7eaf908
Add ability to run replication protocol over redis. (#7040)
This is configured via the `redis` config options.
2020-04-22 13:07:41 +01:00
Richard van der Hoff
0f8f02bc39
On catchup, process each row with its own stream id (#7286)
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
2020-04-20 11:43:29 +01:00
Richard van der Hoff
67ff7b8ba0
Improve type checking in replication.tcp.Stream (#7291)
The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290.

After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP.

I've also introduced a couple of new types, to take out some duplication.
2020-04-17 14:49:55 +01:00
Richard van der Hoff
d7d42387f5
Fix 'generator object is not subscriptable' error (#7290)
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in #7024)
2020-04-16 14:37:06 +01:00
Richard van der Hoff
e13c6c7a96 Handle one-word replication commands correctly
`REPLICATE` is now a valid command, and it's nice if you can issue it from the
console without remembering to call it `REPLICATE ` with a trailing space.
2020-04-07 17:43:46 +01:00
Richard van der Hoff
c3e4b4edb2 Fix warnings about not calling superclass constructor
Separate `SimpleCommand` from `Command`, so that things which don't want to use
the `data` property don't have to, and thus fix the warnings PyCharm was giving
me about not calling `__init__` in the base class.
2020-04-07 17:40:22 +01:00
Richard van der Hoff
6a519a0ca0 Remove vestigal references to SYNC replication command
We've ripped pretty much all of this out: let's remove the remains.
2020-04-07 17:40:07 +01:00
Erik Johnston
ce72355d7f
Fix race in replication (#7226)
Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
2020-04-07 11:01:04 +01:00
Erik Johnston
82498ee901
Move server command handling out of TCP protocol (#7187)
This completes the merging of server and client command processing.
2020-04-07 10:51:07 +01:00
Erik Johnston
5016b162fc
Move client command handling out of TCP protocol (#7185)
The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
2020-04-06 09:58:42 +01:00
Erik Johnston
dfa0782254
Remove connections per replication stream metric. (#7195)
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
2020-04-01 10:40:46 +01:00
Erik Johnston
4f21c33be3
Remove usage of "conn_id" for presence. (#7128)
* Remove `conn_id` usage for UserSyncCommand.

Each tcp replication connection is assigned a "conn_id", which is used
to give an ID to a remotely connected worker. In a redis world, there
will no longer be a one to one mapping between connection and instance,
so instead we need to replace such usages with an ID generated by the
remote instances and included in the replicaiton commands.

This really only effects UserSyncCommand.

* Add CLEAR_USER_SYNCS command that is sent on shutdown.

This should help with the case where a synchrotron gets restarted
gracefully, rather than rely on 5 minute timeout.
2020-03-30 16:37:24 +01:00
Erik Johnston
4cff617df1
Move catchup of replication streams to worker. (#7024)
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
2020-03-25 14:54:01 +00:00
Richard van der Hoff
a564b92d37
Convert *StreamRow classes to inner classes (#7116)
This just helps keep the rows closer to their streams, so that it's easier to
see what the format of each stream is.
2020-03-23 13:59:11 +00:00
Richard van der Hoff
b3cee0ce67
Fix processing of groups stream, and use symbolic names for streams (#7117)
`groups` != `receipts`

Introduced in #6964
2020-03-23 11:39:36 +00:00
Erik Johnston
fdb1344716
Remove concept of a non-limited stream. (#7011) 2020-03-20 14:40:47 +00:00
Erik Johnston
a319cb1dd1
Change device list streams to have one row per ID (#7010)
* Add 'device_lists_outbound_pokes' as extra table.

This makes sure we check all the relevant tables to get the current max
stream ID.

Currently not doing so isn't problematic as the max stream ID in
`device_lists_outbound_pokes` is the same as in `device_lists_stream`,
however that will change.

* Change device lists stream to have one row per id.

This will make it possible to process the streams more incrementally,
avoiding having to process large chunks at once.

* Change device list replication to match new semantics.

Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).

* Newsfile

* Remove handling of multiple rows per ID

* Fix worker handling

* Comments from review
2020-03-19 11:36:53 +00:00
Erik Johnston
6e6476ef07 Comments from review 2020-03-18 10:13:55 +00:00
Richard van der Hoff
78a15b1f9d
Store room_versions in EventBase objects (#6875)
This is a bit fiddly because it all has to be done on one fell swoop:

* Wherever we create a new event, pass in the room version (and check it matches the format version)
* When we prune an event, use the room version of the unpruned event to create the pruned version.
* When we pass an event over the replication protocol, pass the room version over alongside it, and use it when deserialising the event again.
2020-03-05 15:46:44 +00:00
Erik Johnston
9ce4e344a8 Change device list replication to match new semantics.
Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).
2020-02-28 11:25:34 +00:00
Erik Johnston
c3c6c0e622 Add 'device_lists_outbound_pokes' as extra table.
This makes sure we check all the relevant tables to get the current max
stream ID.

Currently not doing so isn't problematic as the max stream ID in
`device_lists_outbound_pokes` is the same as in `device_lists_stream`,
however that will change.
2020-02-28 11:15:11 +00:00
Richard van der Hoff
3e99528f2b
Store room version on invite (#6983)
When we get an invite over federation, store the room version in the rooms table.

The general idea here is that, when we pull the invite out again, we'll want to know what room_version it belongs to (so that we can later redact it if need be). So we need to store it somewhere...
2020-02-26 16:58:33 +00:00
Erik Johnston
1f773eec91
Port PresenceHandler to async/await (#6991) 2020-02-26 15:33:26 +00:00
Erik Johnston
bbf8886a05
Merge worker apps into one. (#6964) 2020-02-25 16:56:55 +00:00
Erik Johnston
0bd8cf435e Increase MAX_EVENTS_BEHIND for replication clients 2020-02-21 09:04:33 +00:00
Erik Johnston
de2d267375
Allow moving group read APIs to workers (#6866) 2020-02-07 11:14:19 +00:00
Erik Johnston
c3d4ad8afd
Fix sending server up commands from workers (#6811)
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
2020-01-30 16:42:11 +00:00
Erik Johnston
e17a110661
Detect unknown remote devices and mark cache as stale (#6776)
We just mark the fact that the cache may be stale in the database for
now.
2020-01-28 14:43:21 +00:00
Erik Johnston
d5275fc55f
Propagate cache invalidates from workers to other workers. (#6748)
Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
2020-01-27 13:47:50 +00:00
Erik Johnston
5d7a6ad223
Allow streaming cache invalidate all to workers. (#6749) 2020-01-22 10:37:00 +00:00
Erik Johnston
a8a50f5b57
Wake up transaction queue when remote server comes back online (#6706)
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
2020-01-17 10:27:19 +00:00
Erik Johnston
48c3a96886
Port synapse.replication.tcp to async/await (#6666)
* Port synapse.replication.tcp to async/await

* Newsfile

* Correctly document type of on_<FOO> functions as async

* Don't be overenthusiastic with the asyncing....
2020-01-16 09:16:12 +00:00
Erik Johnston
28c98e51ff
Add local_current_membership table (#6655)
Currently we rely on `current_state_events` to figure out what rooms a
user was in and their last membership event in there. However, if the
server leaves the room then the table may be cleaned up and that
information is lost. So lets add a table that separately holds that
information.
2020-01-15 14:59:33 +00:00
Erik Johnston
e8b68a4e4b
Fixup synapse.replication to pass mypy checks (#6667) 2020-01-14 14:08:06 +00:00
Richard van der Hoff
6964ea095b
Reduce the reconnect time when replication fails. (#6617) 2020-01-03 14:19:09 +00:00
Erik Johnston
fa780e9721
Change EventContext to use the Storage class (#6564) 2019-12-20 10:32:02 +00:00
Erik Johnston
9a4fb457cf Change DataStores to accept 'database' param. 2019-12-06 13:30:06 +00:00
Erik Johnston
a7f20500ff _CURRENT_STATE_CACHE_NAME is public 2019-12-04 15:45:42 +00:00
Erik Johnston
1056d6885a Move cache invalidation to main data store 2019-12-04 15:21:14 +00:00
Erik Johnston
2173785f0d Propagate reason in remotely rejected invites 2019-11-28 11:31:56 +00:00
Andrew Morgan
a8175d0f96
Prevent account_data content from being sent over TCP replication (#6333) 2019-11-26 13:58:39 +00:00
Erik Johnston
f9f1c8acbb
Merge pull request #6332 from matrix-org/erikj/query_devices_fix
Fix caching devices for remote servers in worker.
2019-11-26 12:56:05 +00:00
Erik Johnston
35f9165e96 Fixup docs 2019-11-26 12:04:48 +00:00
Andrew Morgan
cd96b4586f lint 2019-11-08 15:45:45 +00:00
Andrew Morgan
c4bdf2d785 Remove content from being sent for account data rdata stream 2019-11-08 15:44:02 +00:00
Andrew Morgan
1fe3cc2c9c Address review comments 2019-11-06 14:54:24 +00:00
Andrew Morgan
4059d61e26 Don't forget to ratelimit calls outside of RegistrationHandler 2019-11-06 12:01:54 +00:00
Erik Johnston
c16e192e2f Fix caching devices for remote servers in worker.
When the `/keys/query` API is hit on client_reader worker Synapse may
decide that it needs to resync some remote deivces. Usually this happens
on master, and then gets cached. However, that fails on workers and so
it falls back to fetching devices from remotes directly, which may in
turn fail if the remote is down.
2019-11-05 15:49:43 +00:00
Richard van der Hoff
cc6243b4c0
document the REPLICATE command a bit better (#6305)
since I found myself wonder how it works
2019-11-04 12:40:18 +00:00
Hubert Chathi
9c94b48bf1 Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notify 2019-10-31 12:32:07 -04:00
Hubert Chathi
f7e4a582ef clean up code a bit 2019-10-31 12:01:00 -04:00
Andrew Morgan
54fef094b3
Remove usage of deprecated logger.warn method from codebase (#6271)
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
2019-10-31 10:23:24 +00:00
Hubert Chathi
998f7fe7d4 make user signatures a separate stream 2019-10-30 17:22:52 -04:00
Hubert Chathi
670972c0e1 Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notify 2019-10-30 16:46:31 -04:00
Erik Johnston
e577a4b2ad Port replication http server endpoints to async/await 2019-10-29 13:00:51 +00:00
Hubert Chathi
8ac766c44a make notification of signatures work with workers 2019-10-24 22:14:58 -04:00
Erik Johnston
bb6264be0b Merge branch 'develop' of github.com:matrix-org/synapse into erikj/refactor_stores 2019-10-22 10:41:18 +01:00
Erik Johnston
c66a06ac6b Move storage classes into a main "data store".
This is in preparation for having multiple data stores that offer
different functionality, e.g. splitting out state or event storage.
2019-10-21 16:05:06 +01:00
Hubert Chathi
8e86f5b65c Merge branch 'develop' into uhoreg/e2e_cross-signing_merged 2019-09-07 13:20:34 -04:00
Jorik Schellekens
f7c873a643
Trace how long it takes for the send trasaction to complete, including retrys (#5986) 2019-09-05 17:44:55 +01:00
Jorik Schellekens
909827b422
Add opentracing to all client servlets (#5983) 2019-09-05 14:46:04 +01:00
Hubert Chathi
a22d58c96c add user signature stream change cache to slaved device store 2019-09-04 19:32:35 -04:00
Andrew Morgan
b736c6cd3a
Remove bind_email and bind_msisdn (#5964)
Removes the `bind_email` and `bind_msisdn` parameters from the `/register` C/S API endpoint as per [MSC2140: Terms of Service for ISes and IMs](https://github.com/matrix-org/matrix-doc/pull/2140/files#diff-c03a26de5ac40fb532de19cb7fc2aaf7R107).
2019-09-04 18:24:23 +01:00
Andrew Morgan
4548d1f87e
Remove unnecessary parentheses around return statements (#5931)
Python will return a tuple whether there are parentheses around the returned values or not.

I'm just sick of my editor complaining about this all over the place :)
2019-08-30 16:28:26 +01:00
Jorik Schellekens
812ed6b0d5
Opentracing across workers (#5771)
Propagate opentracing contexts across workers


Also includes some Convenience modifications to opentracing for servlets, notably:
- Add boolean to skip the whitelisting check on inject
  extract methods. - useful when injecting into carriers
  locally. Otherwise we'd always have to include our
  own servername and whitelist our servername
- start_active_span_from_request instead of header
- Add boolean to decide whether to extract context
  from a request to a servlet
2019-08-22 18:08:07 +01:00
Brendan Abolivier
1c5b8c6222 Revert "Add "require_consent" parameter for registration"
This reverts commit 3320aaab3a.
2019-08-22 14:47:34 +01:00
Half-Shot
3320aaab3a Add "require_consent" parameter for registration 2019-08-22 14:21:54 +01:00
Andrew Morgan
baf081cd3b Bugfixes
--------
 
 - Fix a regression introduced in v1.2.0rc1 which led to incorrect labels on some prometheus metrics. ([\#5734](https://github.com/matrix-org/synapse/issues/5734))
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEgQG31Z317NrSMt0QiISIDS7+X/QFAl04Ur0THGFuZHJld0Bh
 bW9yZ2FuLnh5egAKCRCIhIgNLv5f9F4oD/0TY6S/SEd2uAmzor64ojmbX5BOwPzf
 j/wzUTrfvuf40EvkNPDpnejNZSvy/ysbaGQaQusv0SQKlV3xrvdn4RuMvnOWVWck
 kBsO+lvzOaUTR0KHDxN4y9F5eI2NdPbub4847PPVzyqSIHAd+kolxXS8kSBBhwpL
 yfaICWV/AOy5L7xN+JZ9IQpnegVAvUj5DmgXzDHd6VdeiHDVJuARaBgrR5uCkwVS
 ZoLRqZ95XV/qiguMAUvPOwyEqht2mwO64989MswP16YYm8oMkB5QA6I5nYnACsTP
 qk9YcN/oNvEfQXUhttku6MxK1/4yUMPUhEoDBDH7ebc0440QDtWN+IHTdA6oPVZB
 IuStL9YGY16m7Ltx37ZUA4URfNMiSeLHo3zKc/mCAcwxN4HyOjJewtxbG5zKQAOZ
 SMs8UcDwGR4zL1hnt8ZDNYtWwfzJBQIdGjoHvjXJEY7/1csTv2lmAwewFTXiqSAr
 30GW5ews94kotqBK53zZT6V0F5gHNqgGHniOz1ZpqLLxYLqO3LSAGe97CrqlWUdX
 GkhA9tZyweknociD9fyyBmKdcFJ4mL4a+oGI5CMnSMph8UvCY8Y5XMb1T+iYEABI
 tA9G3mBvgkLPj+5V+8QggNkBafSigW2Q4FX7enGsDmiiskZOtfeKrAcVkapD4ooi
 3I7IW5aetZr2IQ==
 =+JBn
 -----END PGP SIGNATURE-----

Merge tag 'v1.2.0rc2' into develop

Bugfixes
--------

- Fix a regression introduced in v1.2.0rc1 which led to incorrect labels on some prometheus metrics. ([\#5734](https://github.com/matrix-org/synapse/issues/5734))
2019-07-24 13:47:51 +01:00
Jorik Schellekens
cf2972c818
Fix servlet metric names (#5734)
* Fix servlet metric names

Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

* Remove redundant check

* Cover all return paths
2019-07-24 13:07:35 +01:00
Amber Brown
4806651744
Replace returnValue with return (#5736) 2019-07-23 23:00:55 +10:00
Richard van der Hoff
824707383b
Remove access-token support from RegistrationHandler.register (#5641)
Nothing uses this now, so we can remove the dead code, and clean up the
API.

Since we're changing the shape of the return value anyway, we take the
opportunity to give the method a better name.
2019-07-08 19:01:08 +01:00
Richard van der Hoff
80cc82a445
Remove support for invite_3pid_guest. (#5625)
This has never been documented, and I'm not sure it's ever been used outside
sytest.

It's quite a lot of poorly-maintained code, so I'd like to get rid of it.

For now I haven't removed the database table; I suggest we leave that for a
future clearout.
2019-07-05 16:47:58 +01:00
Amber Brown
463b072b12
Move logging utilities out of the side drawer of util/ and into logging/ (#5606) 2019-07-04 00:07:04 +10:00
Amber Brown
32e7c9e7f2
Run Black. (#5482) 2019-06-20 19:32:02 +10:00
Erik Johnston
6745b7de6d Handle failing to talk to master over replication 2019-06-07 10:47:31 +01:00
Erik Johnston
5dbff34509 Fixup bsaed on review comments 2019-05-17 15:48:04 +01:00
Erik Johnston
d46aab3fa8 Add basic editing support 2019-05-16 16:54:45 +01:00
Erik Johnston
b5c62c6b26 Fix relations in worker mode 2019-05-16 10:38:13 +01:00
Richard van der Hoff
f50efcb65d Replace SlavedKeyStore with a shim
since we're pulling everything out of KeyStore anyway, we may as well simplify
it.
2019-04-08 23:59:07 +01:00
Richard van der Hoff
3352baac4b
Remove unused server_tls_certificates functions (#5028)
These have been unused since #4120, and with the demise of perspectives, it is
unlikely that they will ever be used again.
2019-04-08 21:50:18 +01:00
Neil Johnson
e8419554ff
Remove presence lists (#4989)
Remove presence list support as per MSC 1819
2019-04-03 11:11:15 +01:00
Richard van der Hoff
297bf2547e
Fix sync bug when accepting invites (#4956)
Hopefully this time we really will fix #4422.

We need to make sure that the cache on
`get_rooms_for_user_with_stream_ordering` is invalidated *before* the
SyncHandler is notified for the new events, and we can now do so reliably via
the `events` stream.
2019-04-02 12:42:39 +01:00
Richard van der Hoff
4b91c313a9 Combine the CurrentStateDeltaStream into the EventStream 2019-03-27 22:07:05 +00:00
Richard van der Hoff
1f6d6f918a Make EventStream rows have a type
... as a precursor to combining it with the CurrentStateDelta stream.
2019-03-27 22:07:05 +00:00
Richard van der Hoff
015b3622eb Skip building a ROW_TYPE when building updates
We're about to turn it straight into a JSON object anyway so building a
ROW_TYPE is a bit pointless, and reduces flexibility in the update_function.
2019-03-27 21:58:03 +00:00
Richard van der Hoff
f570916a3e Add parse_row method to replication stream class
This will allow individual stream classes to override how a row is parsed.
2019-03-27 21:32:33 +00:00
Richard van der Hoff
71dcb275f1 move FederationStream out to its own file 2019-03-27 21:13:14 +00:00
Richard van der Hoff
aa1e017864 move EventsStream out to its own file 2019-03-27 21:13:14 +00:00
Richard van der Hoff
a5798de067 Move replication.tcp.streams into a package 2019-03-27 21:13:14 +00:00
Richard van der Hoff
acaa18f7dd
Fix/improve some docstrings in the replication code. (#4949) 2019-03-27 21:12:36 +00:00
Richard van der Hoff
8cbbedaa2b
Fix ClientReplicationStreamProtocol.__str__ (#4929)
`__str__` depended on `self.addr`, which was absent from
ClientReplicationStreamProtocol, so attempting to call str on such an object
would raise an exception.

We can calculate the peer addr from the transport, so there is no need for addr
anyway.
2019-03-25 16:41:51 +00:00
Richard van der Hoff
9bde730ef8
Fix bug where read-receipts lost their timestamps (#4927)
Make sure that they are sent correctly over the replication stream.

Fixes: #4898
2019-03-25 16:38:05 +00:00
Richard van der Hoff
cdb8036161
Add a config option for torture-testing worker replication. (#4902)
Setting this to 50 or so makes a bunch of sytests fail in worker mode.
2019-03-20 16:04:35 +00:00
Erik Johnston
face0c5b3c Prefill client IPs cache on workers 2019-03-06 17:39:32 +00:00
Andrew Morgan
7b8a157b79
Merge pull request #4792 from matrix-org/anoa/replication_tokens
Support batch updates in the worker sender
2019-03-06 15:48:29 +00:00
Brendan Abolivier
a4c3a361b7
Add rate-limiting on registration (#4735)
* Rate-limiting for registration

* Add unit test for registration rate limiting

* Add config parameters for rate limiting on auth endpoints

* Doc

* Fix doc of rate limiting function

Co-Authored-By: babolivier <contact@brendanabolivier.com>

* Incorporate review

* Fix config parsing

* Fix linting errors

* Set default config for auth rate limiting

* Fix tests

* Add changelog

* Advance reactor instead of mocked clock

* Move parameters to registration specific config and give them more sensible default values

* Remove unused config options

* Don't mock the rate limiter un MAU tests

* Rename _register_with_store into register_with_store

* Make CI happy

* Remove unused import

* Update sample config

* Fix ratelimiting test for py2

* Add non-guest test
2019-03-05 14:25:33 +00:00
Andrew Morgan
b9f6163092 Simplify token replication logic 2019-03-05 13:58:30 +00:00
Erik Johnston
a84b8d56c2 Fixup slave stores 2019-03-04 18:04:57 +00:00
Andrew Morgan
fe7bd23a85 Clean up logic and add comments 2019-03-04 15:08:15 +00:00
Andrew Morgan
9f7cdf3da1 Clearer branching, fix missing list clear 2019-03-04 14:36:52 +00:00
Andrew Morgan
5f0c449dd5 Prevent replication wedging 2019-03-04 14:03:18 +00:00
Erik Johnston
1e315017d3 When presence is enabled don't send over replication 2019-02-27 13:53:46 +00:00
Erik Johnston
7590e9fa28
Merge pull request #4749 from matrix-org/erikj/replication_connection_backoff
Fix tightloop over connecting to replication server
2019-02-27 11:00:59 +00:00
Erik Johnston
6bb1c028f1 Limit cache invalidation replication line length (#4748) 2019-02-27 10:28:37 +00:00
Erik Johnston
6870fc496f Move connecting logic into ClientReplicationStreamProtocol 2019-02-27 10:23:51 +00:00
Erik Johnston
25814921f1 Increase the max delay between retry attempts
Otherwise if you have many workers they can easily take out master with
their connection attempts
2019-02-26 15:12:33 +00:00
Erik Johnston
313987187e Fix tightloop over connecting to replication server
If the client failed to process incoming commands during the initial set
up of the replication connection it would immediately disconnect and
reconnect, resulting in a tightloop.

This can happen, for example, when subscribing to a stream that has a
row that is too long in the backlog.

The fix here is to not consider the connection successfully set up until
the client has succesfully subscribed and caught up with the streams.
This ensures that the retry logic timers aren't reset until then,
meaning that if an error does happen during start up the client will
continue backing off before retrying again.
2019-02-26 15:05:41 +00:00
Erik Johnston
80467bbac3 Fix state cache invalidation on workers 2019-02-22 14:38:14 +00:00
Erik Johnston
dbdc565dfd Fix registration on workers (#4682)
* Move RegistrationHandler init to HomeServer

* Move post registration actions to RegistrationHandler

* Add post regisration replication endpoint

* Newsfile
2019-02-20 18:47:31 +11:00
Erik Johnston
a9b5ea6fc1 Batch cache invalidation over replication
Currently whenever the current state changes in a room invalidate a lot
of caches, which cause *a lot* of traffic over replication. Instead,
lets batch up all those invalidations and send a single poke down
the replication streams.

Hopefully this will reduce load on the master process by substantially
reducing traffic.
2019-02-18 17:53:31 +00:00
Erik Johnston
af691e415c Move register_device into handler 2019-02-18 16:49:38 +00:00