For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.
The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.
Fixes#7334.
Python will return a tuple whether there are parentheses around the returned values or not.
I'm just sick of my editor complaining about this all over the place :)
This ensures that its resource usage metrics get recorded somewhere rather than
getting lost.
(It also fixes an error when called from a nested logging context which
completes before the bg process)
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.
The new helper function doesn't suffer from this problem.
Previously we queued up the poke correctly when the device was deleted,
but then the actual EDU wouldn't get sent, as the device was no longer known.
Instead, we now send EDUs for deleted devices too if there's a poke for them.
There were a bunch of places where we fire off a process to happen in the
background, but don't have any exception handling on it - instead relying on
the unhandled error being logged when the relevent deferred gets
garbage-collected.
This is unsatisfactory for a number of reasons:
- logging on garbage collection is best-effort and may happen some time after
the error, if at all
- it can be hard to figure out where the error actually happened.
- it is logged as a scary CRITICAL error which (a) I always forget to grep for
and (b) it's not really CRITICAL if a background process we don't care about
fails.
So this is an attempt to add exception handling to everything we fire off into
the background.
These processes take a long time compared to the request, so there is lots of
"Entering|Restoring dead context" in the logs. Let's try to shut it up a bit.
In `Notifier._on_new_room_event`, `preserve_fn` around its subroutines which
return deferreds, so that it is safe to call it with an active logcontext.
Remove `@preserve_fn` decorators on `on_new_room_event`,
`_notify_pending_new_room_events`, `_on_new_room_event`, `on_new_event`, and
`on_new_replication_data` - none of these functions return a deferred, and the
decorator does nothing unless the wrapped function returns a deferred, so the
decorator was a no-op.
There was a race condition that caused the notifier to 'miss' the
timeout notification, since there were no other checks for the timeout
this caused listeners to get stuck in a loop until something happened.