Commit Graph

409 Commits

Author SHA1 Message Date
Richard van der Hoff
7f7eedbebb Wait for a POSITION on the right connection before accepting RDATA
... otherwise we can believe we're up to date when we're not.
2020-05-05 22:38:16 +01:00
Richard van der Hoff
d78265af0c Wait to subscribe before sending REPLICATE 2020-05-05 19:31:37 +01:00
Richard van der Hoff
b2dba06079
Workaround for assertion errors from db_query_to_update_function (#7378)
Hopefully this is no worse than what we have on master...
2020-05-01 09:25:16 +01:00
Erik Johnston
37f6823f5b
Add instance name to RDATA/POSITION commands (#7364)
This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
2020-04-29 16:23:08 +01:00
Erik Johnston
3eab76ad43
Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)
For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.

The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.

Fixes #7334.
2020-04-29 14:10:59 +01:00
Richard van der Hoff
c2e1a2110f
Fix limit logic for EventsStream (#7358)
* Factor out functions for injecting events into database

I want to add some more flexibility to the tools for injecting events into the
database, and I don't want to clutter up HomeserverTestCase with them, so let's
factor them out to a new file.

* Rework TestReplicationDataHandler

This wasn't very easy to work with: the mock wrapping was largely superfluous,
and it's useful to be able to inspect the received rows, and clear out the
received list.

* Fix AssertionErrors being thrown by EventsStream

Part of the problem was that there was an off-by-one error in the assertion,
but also the limit logic was too simple. Fix it all up and add some tests.
2020-04-29 12:30:36 +01:00
Erik Johnston
38919b521e
Run replication streamers on workers (#7146)
Currently we never write to streams from workers, but that will change soon
2020-04-28 13:34:12 +01:00
Richard van der Hoff
ce428a1abe Fix EventsStream raising assertions when it falls behind
Figuring out how to correctly limit updates from this stream without dropping
entries is far more complicated than just counting the number of rows being
returned. We need to consider each query separately and, if any one query hits
the limit, truncate the results from the others.

I think this also fixes some potentially long-standing bugs where events or
state changes could get missed if we hit the limit on either query.
2020-04-24 13:59:21 +01:00
Richard van der Hoff
9cbdfb3a2f Make it clear that the limit for an update_function is a target 2020-04-23 15:45:12 +01:00
Richard van der Hoff
23b28266ac Remove 'limit' param from get_repl_stream_updates API
there doesn't seem to be much point in passing this limit all around, since
both sides agree it's meant to be 100.
2020-04-23 15:44:35 +01:00
Richard van der Hoff
71a1abb8a1
Stop the master relaying USER_SYNC for other workers (#7318)
Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication.

In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits.

Fixes (I hope) #7257.
2020-04-22 22:39:04 +01:00
Erik Johnston
841c581c40
Fix replication metrics when using redis (#7325) 2020-04-22 16:26:19 +01:00
Richard van der Hoff
82d8b1dd1f
Another go at fixing one-word commands (#7326)
I messed this up last time I tried (#7239 / e13c6c7).
2020-04-22 14:34:31 +01:00
Erik Johnston
51f7eaf908
Add ability to run replication protocol over redis. (#7040)
This is configured via the `redis` config options.
2020-04-22 13:07:41 +01:00
Richard van der Hoff
0f8f02bc39
On catchup, process each row with its own stream id (#7286)
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
2020-04-20 11:43:29 +01:00
Richard van der Hoff
67ff7b8ba0
Improve type checking in replication.tcp.Stream (#7291)
The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290.

After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP.

I've also introduced a couple of new types, to take out some duplication.
2020-04-17 14:49:55 +01:00
Richard van der Hoff
d7d42387f5
Fix 'generator object is not subscriptable' error (#7290)
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in #7024)
2020-04-16 14:37:06 +01:00
Richard van der Hoff
e13c6c7a96 Handle one-word replication commands correctly
`REPLICATE` is now a valid command, and it's nice if you can issue it from the
console without remembering to call it `REPLICATE ` with a trailing space.
2020-04-07 17:43:46 +01:00
Richard van der Hoff
c3e4b4edb2 Fix warnings about not calling superclass constructor
Separate `SimpleCommand` from `Command`, so that things which don't want to use
the `data` property don't have to, and thus fix the warnings PyCharm was giving
me about not calling `__init__` in the base class.
2020-04-07 17:40:22 +01:00
Richard van der Hoff
6a519a0ca0 Remove vestigal references to SYNC replication command
We've ripped pretty much all of this out: let's remove the remains.
2020-04-07 17:40:07 +01:00
Erik Johnston
ce72355d7f
Fix race in replication (#7226)
Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
2020-04-07 11:01:04 +01:00
Erik Johnston
82498ee901
Move server command handling out of TCP protocol (#7187)
This completes the merging of server and client command processing.
2020-04-07 10:51:07 +01:00
Erik Johnston
5016b162fc
Move client command handling out of TCP protocol (#7185)
The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
2020-04-06 09:58:42 +01:00
Erik Johnston
dfa0782254
Remove connections per replication stream metric. (#7195)
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
2020-04-01 10:40:46 +01:00
Erik Johnston
4f21c33be3
Remove usage of "conn_id" for presence. (#7128)
* Remove `conn_id` usage for UserSyncCommand.

Each tcp replication connection is assigned a "conn_id", which is used
to give an ID to a remotely connected worker. In a redis world, there
will no longer be a one to one mapping between connection and instance,
so instead we need to replace such usages with an ID generated by the
remote instances and included in the replicaiton commands.

This really only effects UserSyncCommand.

* Add CLEAR_USER_SYNCS command that is sent on shutdown.

This should help with the case where a synchrotron gets restarted
gracefully, rather than rely on 5 minute timeout.
2020-03-30 16:37:24 +01:00
Erik Johnston
4cff617df1
Move catchup of replication streams to worker. (#7024)
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
2020-03-25 14:54:01 +00:00
Richard van der Hoff
a564b92d37
Convert *StreamRow classes to inner classes (#7116)
This just helps keep the rows closer to their streams, so that it's easier to
see what the format of each stream is.
2020-03-23 13:59:11 +00:00
Richard van der Hoff
b3cee0ce67
Fix processing of groups stream, and use symbolic names for streams (#7117)
`groups` != `receipts`

Introduced in #6964
2020-03-23 11:39:36 +00:00
Erik Johnston
fdb1344716
Remove concept of a non-limited stream. (#7011) 2020-03-20 14:40:47 +00:00
Erik Johnston
a319cb1dd1
Change device list streams to have one row per ID (#7010)
* Add 'device_lists_outbound_pokes' as extra table.

This makes sure we check all the relevant tables to get the current max
stream ID.

Currently not doing so isn't problematic as the max stream ID in
`device_lists_outbound_pokes` is the same as in `device_lists_stream`,
however that will change.

* Change device lists stream to have one row per id.

This will make it possible to process the streams more incrementally,
avoiding having to process large chunks at once.

* Change device list replication to match new semantics.

Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).

* Newsfile

* Remove handling of multiple rows per ID

* Fix worker handling

* Comments from review
2020-03-19 11:36:53 +00:00
Erik Johnston
6e6476ef07 Comments from review 2020-03-18 10:13:55 +00:00
Richard van der Hoff
78a15b1f9d
Store room_versions in EventBase objects (#6875)
This is a bit fiddly because it all has to be done on one fell swoop:

* Wherever we create a new event, pass in the room version (and check it matches the format version)
* When we prune an event, use the room version of the unpruned event to create the pruned version.
* When we pass an event over the replication protocol, pass the room version over alongside it, and use it when deserialising the event again.
2020-03-05 15:46:44 +00:00
Erik Johnston
9ce4e344a8 Change device list replication to match new semantics.
Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).
2020-02-28 11:25:34 +00:00
Erik Johnston
c3c6c0e622 Add 'device_lists_outbound_pokes' as extra table.
This makes sure we check all the relevant tables to get the current max
stream ID.

Currently not doing so isn't problematic as the max stream ID in
`device_lists_outbound_pokes` is the same as in `device_lists_stream`,
however that will change.
2020-02-28 11:15:11 +00:00
Richard van der Hoff
3e99528f2b
Store room version on invite (#6983)
When we get an invite over federation, store the room version in the rooms table.

The general idea here is that, when we pull the invite out again, we'll want to know what room_version it belongs to (so that we can later redact it if need be). So we need to store it somewhere...
2020-02-26 16:58:33 +00:00
Erik Johnston
1f773eec91
Port PresenceHandler to async/await (#6991) 2020-02-26 15:33:26 +00:00
Erik Johnston
bbf8886a05
Merge worker apps into one. (#6964) 2020-02-25 16:56:55 +00:00
Erik Johnston
0bd8cf435e Increase MAX_EVENTS_BEHIND for replication clients 2020-02-21 09:04:33 +00:00
Erik Johnston
de2d267375
Allow moving group read APIs to workers (#6866) 2020-02-07 11:14:19 +00:00
Erik Johnston
c3d4ad8afd
Fix sending server up commands from workers (#6811)
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
2020-01-30 16:42:11 +00:00
Erik Johnston
e17a110661
Detect unknown remote devices and mark cache as stale (#6776)
We just mark the fact that the cache may be stale in the database for
now.
2020-01-28 14:43:21 +00:00
Erik Johnston
d5275fc55f
Propagate cache invalidates from workers to other workers. (#6748)
Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
2020-01-27 13:47:50 +00:00
Erik Johnston
5d7a6ad223
Allow streaming cache invalidate all to workers. (#6749) 2020-01-22 10:37:00 +00:00
Erik Johnston
a8a50f5b57
Wake up transaction queue when remote server comes back online (#6706)
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
2020-01-17 10:27:19 +00:00
Erik Johnston
48c3a96886
Port synapse.replication.tcp to async/await (#6666)
* Port synapse.replication.tcp to async/await

* Newsfile

* Correctly document type of on_<FOO> functions as async

* Don't be overenthusiastic with the asyncing....
2020-01-16 09:16:12 +00:00
Erik Johnston
28c98e51ff
Add local_current_membership table (#6655)
Currently we rely on `current_state_events` to figure out what rooms a
user was in and their last membership event in there. However, if the
server leaves the room then the table may be cleaned up and that
information is lost. So lets add a table that separately holds that
information.
2020-01-15 14:59:33 +00:00
Erik Johnston
e8b68a4e4b
Fixup synapse.replication to pass mypy checks (#6667) 2020-01-14 14:08:06 +00:00
Richard van der Hoff
6964ea095b
Reduce the reconnect time when replication fails. (#6617) 2020-01-03 14:19:09 +00:00
Erik Johnston
fa780e9721
Change EventContext to use the Storage class (#6564) 2019-12-20 10:32:02 +00:00
Erik Johnston
9a4fb457cf Change DataStores to accept 'database' param. 2019-12-06 13:30:06 +00:00