The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it.
This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens.
The valid operations here are:
1. Is a position before or after a token
2. Fetching all events between two tokens
3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C.
Future PR will change the token type to a dedicated type.
This removes `SourcePaginationConfig` and `get_pagination_rows`. The reasoning behind this is that these generic classes/functions erased the types of the IDs it used (i.e. instead of passing around `StreamToken` it'd pass in e.g. `token.room_key`, which don't have uniform types).
For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.
The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.
Fixes#7334.
Python will return a tuple whether there are parentheses around the returned values or not.
I'm just sick of my editor complaining about this all over the place :)
This ensures that its resource usage metrics get recorded somewhere rather than
getting lost.
(It also fixes an error when called from a nested logging context which
completes before the bg process)
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.
The new helper function doesn't suffer from this problem.
Previously we queued up the poke correctly when the device was deleted,
but then the actual EDU wouldn't get sent, as the device was no longer known.
Instead, we now send EDUs for deleted devices too if there's a poke for them.
There were a bunch of places where we fire off a process to happen in the
background, but don't have any exception handling on it - instead relying on
the unhandled error being logged when the relevent deferred gets
garbage-collected.
This is unsatisfactory for a number of reasons:
- logging on garbage collection is best-effort and may happen some time after
the error, if at all
- it can be hard to figure out where the error actually happened.
- it is logged as a scary CRITICAL error which (a) I always forget to grep for
and (b) it's not really CRITICAL if a background process we don't care about
fails.
So this is an attempt to add exception handling to everything we fire off into
the background.
These processes take a long time compared to the request, so there is lots of
"Entering|Restoring dead context" in the logs. Let's try to shut it up a bit.
In `Notifier._on_new_room_event`, `preserve_fn` around its subroutines which
return deferreds, so that it is safe to call it with an active logcontext.
Remove `@preserve_fn` decorators on `on_new_room_event`,
`_notify_pending_new_room_events`, `_on_new_room_event`, `on_new_event`, and
`on_new_replication_data` - none of these functions return a deferred, and the
decorator does nothing unless the wrapped function returns a deferred, so the
decorator was a no-op.
There was a race condition that caused the notifier to 'miss' the
timeout notification, since there were no other checks for the timeout
this caused listeners to get stuck in a loop until something happened.
This is for two reasons:
1. Suppresses duplicates correctly, as the notifier doesn't do any
duplicate suppression.
2. Makes it easier to connect the AppserviceHandler to the replication
stream.
Change how the notifier updates the map from room_id to user streams on
receiving a join event. Make it update the map when it notifies for the
join event, rather than using the "user_joined_room" distributor signal
Access it directly from the homeserver itself. It already wasn't
inheriting from BaseHandler storing it on the Handlers object was
already somewhat dubious.
pycharm supports them so there is no need to use the other format.
Might as well convert the existing strings to reduce the risk of
people accidentally cargo culting the wrong doc string format.
synapse
This is necessary for replicating the data in synapse to be visible to a
separate service because presence and typing notifications aren't stored
in a database so won't be visible to another process.
This API can be used to either get the raw data by requesting the tables
themselves or to just receive notifications for updates by following the
streams meta-stream.
Returns updates for each table requested a JSON array of arrays with a
row for each row in the table.
Each table is prefixed by a header row with the: name of the table,
current stream_id position for the table, number of rows, number of
columns and the names of the columns.
This is followed by the rows that have been added to the server since
the requester last asked.
The API has a timeout and is hooked up to the notifier so that a slave
can long poll for updates.
Erroring causes problems when people make illegal requests, because they
don't know what limit parameter they should pass.
This is definitely buggy. It leaks message counts for rooms people don't
have permission to see, via tokens. But apparently we already
consciously decided to allow that as a team, so this preserves that
behaviour.