Hopefully this will fix the occasional failures we were seeing in the room directory.
The problem was that events are not necessarily persisted (and `current_state_delta_stream` updated) in the same order as their stream_id. So for instance current_state_delta 9 might be persisted *before* current_state_delta 8. Then, when the room stats saw stream_id 9, it assumed it had done everything up to 9, and never came back to do stream_id 8.
We can solve this easily by only processing up to the stream_id where we know all events have been persisted.
More often than not passing bytes to `txn.execute` is a bug (where we
meant to pass a string) that just happens to work if `BYTEA_OUTPUT` is
set to `ESCAPE`. However, this is a bit of a footgun so we want to
instead error when this happens, and force using `bytearray` if we
actually want to use bytes.
We incorrectly used `room_id` as to bound the result set, even though we
order by `joined_members, room_id`, leading to incorrect results after
pagination.
Copy push rules during a room upgrade from the old room to the new room, instead of deleting them from the old room.
For instance, we've defined upgrading of a room multiple times to be possible, and push rules won't be transferred on the second upgrade if they're deleted during the first.
Also fix some missing yields that probably broke things quite a bit.
We have set the max retry interval to a value larger than a postgres or
sqlite int can hold, which caused exceptions when updating the
destinations table.
To fix postgres we need to change the column to a bigint, and for sqlite
we lower the max interval to 2**62 (which is still incredibly long).
Joining against `events` and ordering by `stream_ordering` is
inefficient as it forced scanning the entirety of the redactions table.
This isn't the case if we use `redactions.received_ts` column as we can
then use an index.
Currently we don't set `have_censored` column if we don't have the
target event of a redaction, which means we repeatedly attempt to censor
the same non-existant event.
When we persist non-redacted events we unset the `have_censored` column
for any redactions that target said event.
This is a) simpler than querying user_ips directly and b) means we can
purge older entries from user_ips without losing the required info.
The storage functions now no longer return the access_token, since it
was unused.
Implements MSC2290. This PR adds two new endpoints, /unstable/account/3pid/add and /unstable/account/3pid/bind. Depending on the progress of that MSC the unstable prefix may go away.
This PR also removes the blacklist on some 3PID tests which occurs in #6042, as the corresponding Sytest PR changes them to use the new endpoints.
Finally, it also modifies the account deactivation code such that it doesn't just try to deactivate 3PIDs that were bound to the user's account, but any 3PIDs that were bound through the homeserver on that user's account.
This is a partial revert of #5893. The problem is that if we drop these tables
in the same release as removing the code that writes to them, it prevents users
users from being able to roll back to a previous release.
So let's leave the tables in place for now, and remember to drop them in a
subsequent release.
(Note that these tables haven't been *read* for *years*, so any missing rows
resulting from a temporary upgrade to vNext won't cause a problem.)
This is a potential solution to https://github.com/vector-im/riot-web/issues/3374
and https://github.com/vector-im/riot-web/issues/5953
as raised by Mozilla at https://github.com/vector-im/riot-web/issues/10868.
This lets you define a push rule action which increases the badge count (unread notification)
count on a given room, but doesn't actually send a push for that notification via email or HTTP.
We might want to define this as the default behaviour for group chats in future
to solve https://github.com/vector-im/riot-web/issues/3268 at last.
This is implemented as a string action rather than a tweak because:
* Other pushers don't care about the tweak, given they won't ever get pushed
* The DB can store the tweak more efficiently using the existing `notify` table.
* It avoids breaking the default_notif/highlight_action optimisations.
Clients which generate their own notifs (e.g. desktop notifs from Riot/Web
would need to be aware of the new push action) to uphold it.
An alternative way to do this would be to maintain a `msg_count` alongside
`highlight_count` and `notification_count` in `unread_notifications` in sync responses.
However, doing this by counting the rows in `events` since the `stream_position`
of the user's last read receipt turns out to be painfully slow (~200ms), perhaps
due to the size of the events table. So instead, we use the highly optimised
existing event_push_actions (and event_push_actions_staging) table to maintain
the counts - using the code paths which already exist for tracking unread
notification counts efficiently. These queries are typically ~3ms or so.
The biggest issues I see here are:
* We're slightly repurposing the `notif` field on `event_push_actions` to
track whether a given action actually sent a `push` or not. This doesn't
seem unreasonable, but it's slightly naughty given that previously the
field explicitly tracked whether `notify` was true for the action (and
as a result, it was uselessly always set to 1 in the DB).
* We're going to put more load on the `event_push_actions` table for all the
random group chats which people had previously muted. In practice i don't
think there are many of these though.
* There isn't an MSC for this yet (although this comment could become one).
We want to assign unique mxids to saml users based on an incrementing
suffix. For that to work, we need to record the allocated mxid in a separate
table.
Previously if the first registered user was a "support" or "bot" user,
when the first real user registers, the auto-join rooms were not
created.
Fix to exclude non-real (ie users with a special user type) users
when counting how many users there are to determine whether we should
auto-create a room.
Signed-off-by: Jason Robinson <jasonr@matrix.org>
Remove all the "double return" statements which were a result of us removing all the instances of
```
defer.returnValue(...)
return
```
statements when we switched to python3 fully.
Python will return a tuple whether there are parentheses around the returned values or not.
I'm just sick of my editor complaining about this all over the place :)
Some of the caches on worker processes were not being correctly invalidated
when a room's state was changed in a way that did not affect the membership
list of the room.
We need to make sure we send out cache invalidations even when no memberships
are changing.
* allow devices to be marked as "hidden"
This is a prerequisite for cross-signing, as it allows us to create other things
that live within the device namespace, so they can be used for signatures.
When persisting events we calculate new stream orderings up front.
Before we notify about an event all events with lower stream orderings
must have finished being persisted.
This PR moves the assignment of stream ordering till *after* calculated
the new current state and split the batch of events into separate chunks
for persistence. This means that if it takes a long time to calculate
new current state then it will not block events in other rooms being
notified about.
This should help reduce some global pauses in the events stream which
can last for tens of seconds (if not longer), caused by some
particularly expensive state resolutions.
Annoyingly, `current_state_events` table can include rejected events,
in which case the membership column will be null. To work around this
lets just always filter out null membership for now.
This is a prerequisite for cross-signing, as it allows us to create other things
that live within the device namespace, so they can be used for signatures.
`None` is not a valid event id, so queuing up a database fetch for it seems
like a silly thing to do.
I considered making `get_event` return `None` if `event_id is None`, but then
its interaction with `allow_none` seemed uninituitive, and strong typing ftw.
This will allow us to efficiently filter out rooms that have been
forgotten in other queries without having to join against the
`room_memberships` table.
We can now use `_get_events_from_cache_or_db` rather than going right back to
the database, which means that (a) we can benefit from caching, and (b) it
opens the way forward to more extensive checks on the original event.
We now always require the original event to exist before we will serve up a
redaction.
Ensures that redactions are correctly authenticated for recent room versions.
There are a few things going on here:
* `_fetch_event_rows` is updated to return a dict rather than a list of rows.
* Rather than returning multiple copies of an event which was redacted
multiple times, it returns the redactions as a list within the dict.
* It also returns the actual rejection reason, rather than merely the fact
that it was rejected, so that we don't have to query the table again in
`_get_event_from_row`.
* The redaction handling is factored out of `_get_event_from_row`, and now
checks if any of the redactions are valid.
A couple of changes here:
* get rid of a redundant `allow_rejected` condition - we should already have filtered out any rejected
events before we get to that point in the code, and the redundancy is confusing. Instead, let's stick in
an assertion just to make double-sure we aren't leaking rejected events by mistake.
* factor out a `_get_events_from_cache_or_db` method, which is going to be important for a
forthcoming fix to redactions.
When asking for the relations of an event, include the original event in the response. This will mostly be used for efficiently showing edit history, but could be useful in other circumstances.
This has never been documented, and I'm not sure it's ever been used outside
sytest.
It's quite a lot of poorly-maintained code, so I'd like to get rid of it.
For now I haven't removed the database table; I suggest we leave that for a
future clearout.
When a client asks for users whose devices have changed since a token we
used to pull *all* users from the database since the token, which could
easily be thousands of rows for old tokens.
This PR changes this to only check for changes for users the client is
actually interested in.
Fixes#5553
There is a README.txt which always sets off this warning, which is a bit
alarming when you first start synapse. I don't think we need to warn about
this.
Fixes intermittent errors observed on Apple hardware which were caused by
time.clock() appearing to go backwards when called from different threads.
Also fixes a bug where database activity times were logged as 1/1000 of their
correct ratio due to confusion between milliseconds and seconds.
Adds new config option `cleanup_extremities_with_dummy_events` which
periodically sends dummy events to rooms with more than 10 extremities.
THIS IS REALLY EXPERIMENTAL.
If we try and send a transaction with lots of EDUs and we run out of
space, we call get_new_device_msgs_for_remote with a limit of 0, which
then failed.
Some keys are stored in the synapse database with a null valid_until_ms
which caused an exception to be thrown when using that key. We fix this
by treating nulls as zeroes, i.e. they keys will match verification
requests with a minimum_valid_until_ms of zero (i.e. don't validate ts)
but will not match requests with a non-zero minimum_valid_until_ms.
Fixes#5391.
Sends password reset emails from the homeserver instead of proxying to the identity server. This is now the default behaviour for security reasons. If you wish to continue proxying password reset requests to the identity server you must now enable the email.trust_identity_server_for_password_resets option.
This PR is a culmination of 3 smaller PRs which have each been separately reviewed:
* #5308
* #5345
* #5368
* Fix background updates to handle redactions/rejections
In background updates based on current state delta stream we need to
handle that we may not have all the events (or at least that
`get_events` may raise an exception).
We have to do this by re-inserting a background update and recreating
tables, as the tables only get created during a background update and
will later be deleted.
We also make sure that we remove any entries that should have been
removed but weren't due to a race that has been fixed in a previous
commit.
When we receive a soft failed event we, correctly, *do not* update the
forward extremity table with the event. However, if we later receive an
event that references the soft failed event we then need to remove the
soft failed events prev events from the forward extremities table,
otherwise we just build up forward extremities.
Fixes#5269
When enabling the account validity feature, Synapse will look at startup for registered account without an expiration date, and will set one equals to 'now + validity_period' for them. On large servers, it can mean that a large number of users will have the same expiration date, which means that they will all be sent a renewal email at the same time, which isn't ideal.
In order to mitigate this, this PR allows server admins to define a 'max_delta' so that the expiration date is a random value in the [now + validity_period ; now + validity_period + max_delta] range. This allows renewal emails to be progressively sent over a configured period instead of being sent all in one big batch.
This is a first step to checking that the key is valid at the required moment.
The idea here is that, rather than passing VerifyKey objects in and out of the
storage layer, we instead pass FetchKeyResult objects, which simply wrap the
VerifyKey and add a valid_until_ts field.
Storing server keys hammered the database a bit. This replaces the
implementation which stored a single key, with one which can do many updates at
once.
I was staring at this function trying to figure out wtf it was actually
doing. This is (hopefully) a non-functional refactor which makes it a bit
clearer.
If account validity is enabled in the server's configuration, this job will run at startup as a background job and will stick an expiration date to any registered account missing one.
We need to drop tables in the correct order due to foreign table
constraints (on `application_services`), otherwise the DROP TABLE
command will fail.
Introduced in #4992.
We start all pushers on start up and immediately start a background
process to fetch push to send. This makes start up incredibly painful
when dealing with many pushers.
Instead, let's do a quick fast DB check to see if there *may* be push to
send and only start the background processes for those pushers. We also
stagger starting up and doing those checks so that we don't try and
handle all pushers at once.
Hopefully this time we really will fix#4422.
We need to make sure that the cache on
`get_rooms_for_user_with_stream_ordering` is invalidated *before* the
SyncHandler is notified for the new events, and we can now do so reliably via
the `events` stream.
We assume, as we did before, that users bound their threepid to one of
the trusted identity servers. So we simply fill the new table with all
threepids in `user_threepids` joined with the trusted identity servers.
Currently whenever the current state changes in a room invalidate a lot
of caches, which cause *a lot* of traffic over replication. Instead,
lets batch up all those invalidations and send a single poke down
the replication streams.
Hopefully this will reduce load on the master process by substantially
reducing traffic.
This allows registration to be handled by a worker, though the actual
write to the database still happens on master.
Note: due to the in-memory session map all registration requests must be
handled by the same worker.
Due to the table locks taken out by the naive upsert, the table
statistics may be out of date. During deduplication it is important that
the correct index is used as otherwise a full table scan may be
incorrectly used, which can end up thrashing the database badly.
The background update to remove duplicate rows naively deleted and
reinserted the duplicates. For large tables with a large number of
duplicates this causes a lot of bloat (with postgres), as the inserted
rows are appended to the table, since deleted rows will not be
overwritten until a VACUUM has happened.
This should hopefully also help ensure that the query in the last batch
uses the correct index, as inserting a large number of new rows without
analyzing will upset the query planner.
Add more tables to the list of tables which need a background update to
complete before we can upsert into them, which fixes a race against the
background updates.
Currently they're stored as non-outliers even though the server isn't in
the room, which can be problematic in places where the code assumes it
has the state for all non outlier events.
In particular, there is an edge case where persisting the leave event
triggers a state resolution, which requires looking up the room version
from state. Since the server doesn't have the state, this causes an
exception to be thrown.
Currently we only have the one event format version defined, but this
adds the necessary infrastructure to persist and fetch the format
versions alongside the events.
We specify the format version rather than the room version as:
1. We don't necessarily know the room version, existing events may be
either v1 or v2.
2. We'd need to be careful to prevent/handle correctly if different
events in the same room reported to be of different versions, which
sounds annoying.
This was caused by accidentally overwritting a `last_seen` variable
in a for loop, causing the wrong value to be written to the progress
table. The result of which was that we didn't scan sections of the table
when searching for duplicates, and so some duplicates did not get
deleted.
* Fix race when persisting create event
When persisting a chunk of DAG it is sometimes requried to do a state
resolution, which requires knowledge of the room version. If this
happens while we're persisting the create event then we need to use that
event rather than attempting to look it up in the database.