There were a bunch of places where we fire off a process to happen in the
background, but don't have any exception handling on it - instead relying on
the unhandled error being logged when the relevent deferred gets
garbage-collected.
This is unsatisfactory for a number of reasons:
- logging on garbage collection is best-effort and may happen some time after
the error, if at all
- it can be hard to figure out where the error actually happened.
- it is logged as a scary CRITICAL error which (a) I always forget to grep for
and (b) it's not really CRITICAL if a background process we don't care about
fails.
So this is an attempt to add exception handling to everything we fire off into
the background.
These processes take a long time compared to the request, so there is lots of
"Entering|Restoring dead context" in the logs. Let's try to shut it up a bit.
In `Notifier._on_new_room_event`, `preserve_fn` around its subroutines which
return deferreds, so that it is safe to call it with an active logcontext.
Remove `@preserve_fn` decorators on `on_new_room_event`,
`_notify_pending_new_room_events`, `_on_new_room_event`, `on_new_event`, and
`on_new_replication_data` - none of these functions return a deferred, and the
decorator does nothing unless the wrapped function returns a deferred, so the
decorator was a no-op.
There was a race condition that caused the notifier to 'miss' the
timeout notification, since there were no other checks for the timeout
this caused listeners to get stuck in a loop until something happened.
This is for two reasons:
1. Suppresses duplicates correctly, as the notifier doesn't do any
duplicate suppression.
2. Makes it easier to connect the AppserviceHandler to the replication
stream.
Change how the notifier updates the map from room_id to user streams on
receiving a join event. Make it update the map when it notifies for the
join event, rather than using the "user_joined_room" distributor signal
Access it directly from the homeserver itself. It already wasn't
inheriting from BaseHandler storing it on the Handlers object was
already somewhat dubious.
pycharm supports them so there is no need to use the other format.
Might as well convert the existing strings to reduce the risk of
people accidentally cargo culting the wrong doc string format.
synapse
This is necessary for replicating the data in synapse to be visible to a
separate service because presence and typing notifications aren't stored
in a database so won't be visible to another process.
This API can be used to either get the raw data by requesting the tables
themselves or to just receive notifications for updates by following the
streams meta-stream.
Returns updates for each table requested a JSON array of arrays with a
row for each row in the table.
Each table is prefixed by a header row with the: name of the table,
current stream_id position for the table, number of rows, number of
columns and the names of the columns.
This is followed by the rows that have been added to the server since
the requester last asked.
The API has a timeout and is hooked up to the notifier so that a slave
can long poll for updates.
Erroring causes problems when people make illegal requests, because they
don't know what limit parameter they should pass.
This is definitely buggy. It leaks message counts for rooms people don't
have permission to see, via tokens. But apparently we already
consciously decided to allow that as a team, so this preserves that
behaviour.
This requires the notifier to have knowledge of appservice listeners so it can
do the regex checks on incoming invites to see if the state_key matches. It
isn't enough to just rely on the room listeners and store.get_app_service_rooms
as the room will initially not exist or won't be on the ASes radar due to
having none of its users in the room.
This is so that the DeferredLists actually consume the error instead of
propogating down the non-existent errback chain. This should reduce the
number of unhandled errors we are seeing.