Commit Graph

4745 Commits

Author SHA1 Message Date
Eric Eastwood
fa8616e65c
Fix MSC3030 /timestamp_to_event returning outliers that it has no idea whether are near a gap or not (#14215)
Fix MSC3030 `/timestamp_to_event` endpoint returning `outliers` that it has no idea whether are near a gap or not (and therefore unable to determine whether it's actually the closest event). The reason Synapse doesn't know whether an `outlier` is next to a gap is because our gap checks rely on entries in the `event_edges`, `event_forward_extremeties`, and `event_backward_extremities` tables which is [not the case for `outliers`](2c63cdcc3f/docs/development/room-dag-concepts.md (outliers)).

Also fixes MSC3030 Complement `can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint` test flake.  Although this acted flakey in Complement, if `sync_partial_state` raced and beat us before `/timestamp_to_event`, then even if we retried the failing `/context` request it wouldn't work until we made this Synapse change. With this PR, Synapse will never return an `outlier` event so that test will always go and ask over federation.

Fix  https://github.com/matrix-org/synapse/issues/13944


### Why did this fail before? Why was it flakey?

Sleuthing the server logs on the [CI failure](https://github.com/matrix-org/synapse/actions/runs/3149623842/jobs/5121449357#step:5:5805), it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](cb20b885cb/synapse/handlers/room.py (L1470-L1496)) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

With this PR, Synapse will never return an `outlier` event so that test will always go and ask over federation.
2022-10-18 19:46:25 -05:00
Aaron Raimist
2a76a7369f
Fix hiding devices names over federation (#10015)
And don't include blank opentracing stuff in device list updates.

Signed-off-by: Aaron Raimist <aaron@raim.ist>
2022-10-18 20:54:27 +00:00
Patrick Cloke
dbf18f514e
Update the thread_id right before use (in case the bg update hasn't finished) (#14222)
This avoids running a forced-update of a null thread_id rows.

An index is added (in the background) to hopefully make this
easier in the future.
2022-10-18 14:55:41 +00:00
David Robertson
c3a4780080
When restarting a partial join resync, prioritise the server which actioned a partial join (#14126) 2022-10-18 12:33:18 +01:00
Andrew Morgan
dc02d9f8c5
Avoid checking the event cache when backfilling events (#14164) 2022-10-18 10:33:35 +01:00
Andrew Morgan
828b5502cf
Remove _get_events_cache check optimisation from _have_seen_events_dict (#14161) 2022-10-18 10:33:21 +01:00
Patrick Cloke
4283bd1cf9
Support filtering the /messages API by relation type (MSC3874). (#14148)
Gated behind an experimental configuration flag.
2022-10-17 11:32:11 -04:00
Nick Mills-Barrett
2c2c3f8b2c
Invalidate rooms for user caches when receiving membership events (#14155)
This should fix a race where the event notification comes in over
replication before the state replication, leaving a window during
which a sync may get an incorrect list of rooms for the user.
2022-10-17 13:27:51 +01:00
Eric Eastwood
40bb37eb27
Stop getting missing prev_events after we already know their signature is invalid (#13816)
While https://github.com/matrix-org/synapse/pull/13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to
 - https://github.com/matrix-org/synapse/issues/13622
    - https://github.com/matrix-org/synapse/pull/13635
 - https://github.com/matrix-org/synapse/issues/13676

Part of https://github.com/matrix-org/synapse/issues/13356

Follow-up to https://github.com/matrix-org/synapse/pull/13815 which tracks event signature failures.

With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.

For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          -  `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          -  `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         -  Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               -  `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               -  `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
2022-10-15 00:36:49 -05:00
Patrick Cloke
bc2bd92b93 Merge remote-tracking branch 'origin/release-v1.69' into develop 2022-10-14 14:11:27 -04:00
Patrick Cloke
d1bdeccb50
Accept threaded receipts for events related to the root event. (#14174)
The root node of a thread (and events related to it) are considered
"part of a thread" when validating receipts. This allows clients which
show the root node in both the main timeline and the threaded timeline
to easily send receipts in either.

Note that threaded notifications are not created for these events, these
events created notifications on the main timeline.
2022-10-14 18:05:25 +00:00
Erik Johnston
d241a1350d
Fix background update to use an index (#14181) 2022-10-14 13:46:23 +00:00
Patrick Cloke
126a15794c
Do not allow a None-limit on PaginationConfig. (#14146)
The callers either set a default limit or manually handle a None-limit
later on (by setting a default value).

Update the callers to always instantiate PaginationConfig with a default
limit and then assume the limit is non-None.
2022-10-14 12:30:05 +00:00
Patrick Cloke
9ff4155f6c
Properly invalidate get_thread_id cache. (#14163)
This was missed in 2b6d41ebd6 (#13824).
2022-10-14 07:10:44 -04:00
David Robertson
16c5d95b59
Optimise the event_push_backfill_thread_id bg job (#14172)
Co-authored-by: Erik Johnston <erik@matrix.org>
2022-10-13 17:32:16 +00:00
Patrick Cloke
2019b60f3b
Fix sqlite syntax for upserts. (#14171) 2022-10-13 12:53:24 -04:00
Patrick Cloke
7d59a515bb Properly return the thread ID down sync. (#14159)
Fix a broken conflict in e6e876b9b1,
by not stomping over a field right after creating it.
2022-10-13 12:15:41 -04:00
Patrick Cloke
3bbe532abb
Add an API for listing threads in a room. (#13394)
Implement the /threads endpoint from MSC3856.

This is currently unstable and behind an experimental configuration
flag.

It includes a background update to backfill data, results from
the /threads endpoint will be partial until that finishes.
2022-10-13 08:02:11 -04:00
Patrick Cloke
e6e876b9b1 Return the thread ID properly down sync. (#14159)
A receipt's thread ID, if one exists, should be added to the
body of a receipt.
2022-10-12 12:18:34 -04:00
Patrick Cloke
87099b6ea5
Return the main timeline for events which are not part of a thread. (#14140)
Fixes a bug where threaded receipts could not be sent for the
main timeline.
2022-10-12 12:15:52 -04:00
Nick Mills-Barrett
f9bc5428c4
Batch up calls to get_rooms_for_users (#14109) 2022-10-12 11:36:22 +01:00
Patrick Cloke
09be8ab5f9
Remove the experimental implementation of MSC3772. (#14094)
MSC3772 has been abandoned.
2022-10-12 06:26:39 -04:00
Shay
a86b2f6837
Fix a bug where redactions were not being sent over federation if we did not have the original event. (#13813) 2022-10-11 11:18:45 -07:00
Erik Johnston
02086e1da0
Fix rotating existing notifications in push summary (#14138)
Broke by #14045. Fixes #14120.

Introduced in v1.69.0rc2.
2022-10-11 15:13:32 +00:00
Patrick Cloke
ab8047b4bf
Apply & bundle edits for non-message events. (#14034)
Fixes two related bugs:

* No edit information was bundled for events which aren't `m.room.message`.
* `m.new_content` was not applied for those events.
2022-10-07 15:27:50 +00:00
Patrick Cloke
0b037d6c91
Fix handling of public rooms filter with a network tuple. (#14053)
Fixes two related bugs:

* The handling of `[null]` for a `room_types` filter was incorrect.
* The ordering of arguments when providing both a network tuple
  and room type field was incorrect.
2022-10-05 12:49:52 +00:00
Patrick Cloke
e3d4755454
Fix backwards compatibility with upcoming threads schema changes. (#14045)
Ensure that the upsert will work properly by first updating any existing
rows (in the same way that the background update to backfill data works).
2022-10-05 07:56:05 -04:00
Patrick Cloke
dcced5a8d7
Use threaded receipts when fetching events for push. (#13878)
Update the HTTP and email pushers to consider threaded read receipts
when fetching unread events.
2022-10-04 12:07:02 -04:00
Patrick Cloke
2b6d41ebd6
Recursively fetch the thread for receipts & notifications. (#13824)
Consider an event to be part of a thread if you can follow a
chain of relations up to a thread root.

Part of MSC3773 & MSC3771.
2022-10-04 11:36:16 -04:00
Patrick Cloke
a7ba457b2b
Mark events as read using threaded read receipts from MSC3771. (#13877)
Applies the proper logic for unthreaded and threaded receipts to either
apply to all events in the room or only events in the same thread, respectively.
2022-10-04 10:46:42 -04:00
Patrick Cloke
b4ec4f5e71
Track notification counts per thread (implement MSC3773). (#13776)
When retrieving counts of notifications segment the results based on the
thread ID, but choose whether to return them as individual threads or as
a single summed field by letting the client opt-in via a sync flag.

The summarization code is also updated to be per thread, instead of per
room.
2022-10-04 09:47:04 -04:00
Patrick Cloke
e70c6b720e
Disable pushing for server ACL events (MSC3786). (#13997)
Switches to the stable identifier for MSC3786 and enables it
by default.

This disables pushes of m.room.server_acl events.
2022-10-04 07:08:27 -04:00
Erik Johnston
5a6d025246
Clear out old rows from event_push_actions_staging (#14020)
On matrix.org we have ~5 million stale rows in `event_push_actions_staging`, let's add a background job to make sure we clear them out.
2022-10-03 18:44:44 +01:00
Erik Johnston
2c237debd3
Fix bug where we didn't delete staging push actions (#14014)
Introduced in #13719
2022-10-03 13:45:19 +00:00
Erik Johnston
606b2d9009
Add cache to get_partial_state_servers_at_join (#14013) 2022-10-03 13:13:11 +00:00
Sean Quah
d65862c41f
Refactor _get_e2e_device_keys_txn to split large queries (#13956)
Instead of running a single large query, run a single query for
user-only lookups and additional queries for batches of user device
lookups.

Resolves #13580.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-10-03 13:46:36 +01:00
David Robertson
285d72556b
Update mypy and mypy-zope, attempt 3 (#13993)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-09-30 17:36:28 +01:00
David Robertson
8e52cb0bce
Revert "Update mypy and mypy-zope (#13925)"
This reverts commit 6d543d6d9f.
2022-09-30 16:37:48 +01:00
David Robertson
6d543d6d9f
Update mypy and mypy-zope (#13925)
* Update mypy and mypy-zope

* Unignore assigning to LogRecord attributes

Presumably https://github.com/python/typeshed/pull/8064 makes this ok

Cherry-picked from #13521

* Remove unused ignores due to mypy ParamSpec fixes

https://github.com/python/mypy/pull/12668

Cherry-picked from #13521

* Remove additional unused ignores

* Fix new mypy complaints related to `assertGreater`

Presumably due to https://github.com/python/typeshed/pull/8077

* Changelog

* Reword changelog

Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>

Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-09-30 16:34:47 +01:00
Erik Johnston
3dfc4a08dc
Fix performance regression in get_users_in_room (#13972)
Fixes #13942. Introduced in #13575.

Basically, let's only get the ordered set of hosts out of the DB if we need an ordered set of hosts. Since we split the function up the caching won't be as good, but I think it will still be fine as e.g. multiple backfill requests for the same room will hit the cache.
2022-09-30 13:15:32 +01:00
David Robertson
e8f30a76ca
Fix overflows in /messages backfill calculation (#13936)
* Reproduce bug
* Compute `least_function` first
* Substitute `least_function` with an f-string
* Bugfix: avoid overflow

Co-authored-by: Eric Eastwood <erice@element.io>
2022-09-30 11:54:53 +01:00
David Robertson
15754d720f
Update UPSERT comment now that native upserts are the default (#13924) 2022-09-29 19:10:47 +01:00
Nick Mills-Barrett
a466164647
Optimise get_rooms_for_user (drop with_stream_ordering) (#13787) 2022-09-29 13:55:12 +00:00
Brendan Abolivier
be76cd8200
Allow admins to require a manual approval process before new accounts can be used (using MSC3866) (#13556) 2022-09-29 15:23:24 +02:00
Patrick Cloke
8625ad8099
Explicit cast to enforce type hints. (#13939) 2022-09-29 07:22:41 -04:00
Patrick Cloke
568016929f
Clarify that a method returns only unthreaded receipts. (#13937)
By renaming it and updating the docstring.

Additionally, refactors a method which is used only by tests.
2022-09-29 07:07:31 -04:00
Erik Johnston
5f659d4a88
Handle local device list updates during partial join (#13934) 2022-09-28 23:22:35 +01:00
Eric Eastwood
df8b91ed2b
Limit and filter the number of backfill points to get from the database (#13879)
There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history.

Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after https://github.com/matrix-org/synapse/pull/13635 

This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place.

Part of https://github.com/matrix-org/synapse/issues/13356
2022-09-28 15:26:16 -05:00
Patrick Cloke
1386ce4735
Revert "Stop returning an unused column when handling new receipts. (#13933)" (#13935)
This reverts commit 7766bd5b35 (#13933).

The unused column is actually used, but much further down in the function.
2022-09-28 11:01:41 -04:00
Patrick Cloke
7766bd5b35
Stop returning an unused column when handling new receipts. (#13933) 2022-09-28 10:58:25 -04:00
Erik Johnston
4b17a5ace8
Handle remote device list updates during partial join (#13913)
c.f. #12993 (comment), point 3

This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state.

Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.
2022-09-28 13:42:43 +00:00
Kateřina Churanová
6caa303083
fix: Push notifications for invite over federation (#13719) 2022-09-28 12:31:53 +00:00
Eric Eastwood
29269d9d3f
Fix have_seen_event cache not being invalidated (#13863)
Fix https://github.com/matrix-org/synapse/issues/13856
Fix https://github.com/matrix-org/synapse/issues/13865

> Discovered while trying to make Synapse fast enough for [this MSC2716 test for importing many batches](https://github.com/matrix-org/complement/pull/214#discussion_r741678240). As an example, disabling the `have_seen_event` cache saves 10 seconds for each `/messages` request in that MSC2716 Complement test because we're not making as many federation requests for `/state` (speeding up `have_seen_event` itself is related to https://github.com/matrix-org/synapse/issues/13625) 
> 
> But this will also make `/messages` faster in general so we can include it in the [faster `/messages` milestone](https://github.com/matrix-org/synapse/milestone/11).
> 
> *-- https://github.com/matrix-org/synapse/issues/13856*


### The problem

`_invalidate_caches_for_event` doesn't run in monolith mode which means we never even tried to clear the `have_seen_event` and other caches. And even in worker mode, it only runs on the workers, not the master (AFAICT).

Additionally there was bug with the key being wrong so `_invalidate_caches_for_event` never invalidates the `have_seen_event` cache even when it does run.

Because we were using the `@cachedList` wrong, it was putting items in the cache under keys like `((room_id, event_id),)` with a `set` in a `set` (ex. `(('!TnCIJPKzdQdUlIyXdQ:test', '$Iu0eqEBN7qcyF1S9B3oNB3I91v2o5YOgRNPwi_78s-k'),)`) and we we're trying to invalidate with just `(room_id, event_id)` which did nothing.
2022-09-27 15:55:43 -05:00
David Robertson
f5aaa55e27
Add new columns tracking when we partial-joined (#13892) 2022-09-27 17:26:35 +01:00
Erik Johnston
e8318a4333
Handle the case of remote users leaving a partial join room for device lists (#13885) 2022-09-27 13:01:08 +01:00
Patrick Cloke
2fae1a3f78
Improve tests for get_unread_push_actions_for_user_in_range_*. (#13893)
* Adds a docstring.
* Reduces a small amount of duplicated code.
* Improves tests.
2022-09-26 18:28:12 +00:00
David Robertson
0a38c7ec6d
Snapshot schema 72 (#13873)
Including another batch of fixes to the schema dump script
2022-09-26 18:28:32 +01:00
Nick Mills-Barrett
6b4593a80f
Simplify cache invalidation after event persist txn (#13796)
This moves all the invalidations into a single place and de-duplicates
the code involved in invalidating caches for a given event by using
the base class method.
2022-09-26 16:26:35 +01:00
Eric Eastwood
ac1a31740b
Only try to backfill event if we haven't tried before recently (#13635)
Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over.

Fix https://github.com/matrix-org/synapse/issues/13622
Fix https://github.com/matrix-org/synapse/issues/8451

Follow-up to https://github.com/matrix-org/synapse/pull/13589

Part of https://github.com/matrix-org/synapse/issues/13356
2022-09-23 14:01:29 -05:00
Sean Quah
f49f73c0da
Faster room joins: Avoid blocking /keys/changes (#13888)
Part of the work for #12993.

Once #12993 is fully resolved, we expect `/keys/changes` to behave
sensibly when joined to a room with partial state.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-09-23 17:55:15 +01:00
Patrick Cloke
efd108b45d
Accept & store thread IDs for receipts (implement MSC3771). (#13782)
Updates the `/receipts` endpoint and receipt EDU handler to parse a
`thread_id` from the body and insert it in the database.
2022-09-23 14:33:28 +00:00
Sean Quah
03c2bfb7f8
Send device list updates out to servers in partially joined rooms (#13874)
Use the provided list of servers in the room from the `/send_join`
response, since we will not know which users are in the room.  This
isn't sufficient to ensure that all remote servers receive the right
device list updates, since the `/send_join` response may be inaccurate
or we may calculate the membership state of new users in the room
incorrectly.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-09-23 13:44:03 +01:00
Patrick Cloke
b7272b73aa
Properly paginate forward in the /relations API. (#13840)
This fixes a bug where the `/relations` API with `dir=f` would
skip the first item of each page (except the first page), causing
incomplete data to be returned to the client.
2022-09-22 12:47:49 +00:00
Brendan Abolivier
ccca14140a
Track device IDs for pushers (#13831)
Second half of the MSC3881 implementation
2022-09-21 15:31:53 +00:00
Brendan Abolivier
8ae42ab8fa
Support enabling/disabling pushers (from MSC3881) (#13799)
Partial implementation of MSC3881
2022-09-21 14:39:01 +00:00
Mathieu Velten
6bd8763804
Add cache invalidation across workers to module API (#13667)
Signed-off-by: Mathieu Velten <mathieuv@matrix.org>
2022-09-21 15:32:01 +02:00
David Robertson
fff9b955fa
Generate separate snapshots for logical databases (#13792)
* Generate separate snapshots for sqlite, postgres and common
* Cleanup postgres dbs in the TRAP
* Say which logical DB we're applying updates to
* Run background updates on the state DB
* Add new option for accepting a SCHEMA_NUMBER
2022-09-20 14:14:12 +01:00
Erik Johnston
42d261c32f
Port the push rule classes to Rust. (#13768) 2022-09-20 12:10:31 +01:00
Eric Eastwood
44be42338e
Add support to purge rows from MSC2716 and other tables when purging a room (#13825)
`event_failed_pull_attempts` added in https://github.com/matrix-org/synapse/pull/13589

MSC2716 related tables added in:

 - https://github.com/matrix-org/synapse/pull/10245/files#diff-3d42dfb44d02f7de3aada105e0bdc1cc9dd7f953cbf0f36c5d0f50827bf0320aR1
    - Renamed in https://github.com/matrix-org/synapse/pull/10838/files#diff-2730bfbe9e688b55e46f9371aefe67dac2bd2b2b7d9d6b92774eea1fcfae156dR1
 - https://github.com/matrix-org/synapse/pull/10498/files#diff-c52bbfbb5921a3f6f023b24343668479d966fac164f13b7c39d2197ce3afa7a5R1
2022-09-16 10:56:56 -05:00
Patrick Cloke
b2b0c85279
Support providing an index predicate for upserts. (#13822)
This is useful to upsert against a table which has a unique
partial index while avoiding conflicts.
2022-09-15 18:28:48 +00:00
Eric Eastwood
957e3d74fc
Keep track when we try and fail to process a pulled event (#13589)
We can follow-up this PR with:

 1. Only try to backfill from an event if we haven't tried recently -> https://github.com/matrix-org/synapse/issues/13622
 1. When we decide to backfill that event again, process it in the background so it doesn't block and make `/messages` slow when we know it will probably fail again -> https://github.com/matrix-org/synapse/issues/13623
 1. Generally track failures everywhere we try and fail to pull an event over federation -> https://github.com/matrix-org/synapse/issues/13700

Fix https://github.com/matrix-org/synapse/issues/13621

Part of https://github.com/matrix-org/synapse/issues/13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.qv7cj51sv9i5)
2022-09-14 13:57:50 -05:00
Patrick Cloke
666ae87729
Update event push action and receipt tables to support threads. (#13753)
Adds a `thread_id` column to the `event_push_actions`, `event_push_actions_staging`,
and `event_push_summary` tables. This will notifications to be segmented by the thread
in a future pull request. The `thread_id` column stores the root event ID or the special
value `"main"`.

The `thread_id` column for `event_push_actions` and `event_push_summary` is
backfilled with `"main"` for all existing rows. New entries into `event_push_actions`
and `event_push_actions_staging` will get the proper thread ID.

`receipts_linearized` and `receipts_graph` also gain a `thread_id` column, which is similar,
except `NULL` is a special value meaning the receipt is "unthreaded".

See MSC3771 and MSC3773 for where this data will be useful.
2022-09-14 17:11:16 +00:00
Patrick Cloke
f2d12ccabe
Use partial indices on SQLIte. (#13802)
Partial indices have been supported since SQLite 3.8, but Synapse
now requires >= 3.27, so we can enable support for them.

This requires rebuilding previous indices which were partial on
PostgreSQL, but not on SQLite.
2022-09-14 12:01:42 -04:00
reivilibre
6302753012
Deduplicate is_server_notices_room. (#13780) 2022-09-14 15:53:18 +00:00
David Robertson
51a77e990b
Remove incorrect migration file from state logical DB (#13788)
* Remove incorrect migration file from `state` logical DB

The table `ex_outlier_stream` is part of the `main` logical DB; it
should not have been created in the `state` logical DB. We remove this
migration now as a tidy-up.

Note: we cannot `DROP TABLE IF EXISTS ex_outlier_stream` in a new
migration, because some (most) instances of Synapse host both of these
logical DBs on the same DB cluster.

* Changelog
2022-09-14 14:16:12 +01:00
Sean Quah
c73774467e
Fix bug in device list caching when remote users leave rooms (#13749)
When a remote user leaves the last room shared with the homeserver, we
have to mark their device list as unsubscribed, otherwise we would hold
on to a stale device list in our cache. Crucially, the device list would
remain cached even after the remote user rejoined the room, which could
lead to E2EE failures until the next change to the remote user's device
list.

Fixes #13651.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-09-14 10:42:57 +01:00
Mathieu Velten
12dacecabd
Make sequence cache_invalidation_stream_seq begin at 2 (#13766)
Signed-off-by: Mathieu Velten <mathieuv@matrix.org>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2022-09-13 16:14:28 +02:00
David Robertson
b60d47ab2c
Updates to the schema dump script (#13770) 2022-09-13 10:53:11 +01:00
Nick Mills-Barrett
cdbb641232
Add receipts event stream ordering (#13703) 2022-09-13 08:16:37 +01:00
Nick Mills-Barrett
da41a7cd61
Remove check current state membership up to date (#13745)
* Remove checks for membership column in current_state_events
* Add schema script to force through the
  `current_state_events_membership` background job

Contributed by Nick @ Beeper (@fizzadar).
2022-09-12 12:58:33 +01:00
Patrick Cloke
3d9f82efcb
Use an upsert for receipts_graph. (#13752)
Instead of a delete, then insert.

This was previously done for `receipts_linearized` in
2dc430d36e (#7607).
2022-09-09 07:08:41 -04:00
David Robertson
f2d2481e56
Require SQLite >= 3.27.0 (#13760) 2022-09-09 11:14:10 +01:00
Dirk Klimpel
f799eac7ea
Add timestamp to user's consent (#13741)
Co-authored-by: reivilibre <olivier@librepush.net>
2022-09-08 15:41:48 +00:00
Sean Quah
906cead9ca
Update docstrings to explain the impact of partial state (#13750)
Update the docstrings for `get_users_in_room` and
`get_current_hosts_in_room` to explain the impact of partial state.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-09-08 15:55:29 +01:00
Sean Quah
89e8b98b65
Avoid raising errors due to malformed IDs in get_current_hosts_in_room (#13748)
Handle malformed user IDs with no colons in `get_current_hosts_in_room`.
It's not currently possible for a malformed user ID to join a room, so
this error would never be hit.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-09-08 15:55:03 +01:00
Eric Eastwood
d4d3249ded
Instrument get_metadata_for_events for tracing (#13730)
When backfilling, `_get_state_ids_after_missing_prev_event` calls [`get_metadata_for_events`](26bc26586b/synapse/handlers/federation_event.py (L1133)). For `#matrix:matrix.org`, it's called with 77k `state_events` which means 77 calls to the database and takes 28 seconds.
2022-09-07 11:41:52 -05:00
reivilibre
d3d9ca156e
Cancel the processing of key query requests when they time out. (#13680) 2022-09-07 12:03:32 +01:00
reivilibre
c2fe48a6ff
Rename the EventFormatVersions enum values so that they line up with room version numbers. (#13706) 2022-09-07 11:08:20 +01:00
Patrick Cloke
48a5c47a9f
Add a schema delta to drop unstable private read receipts. (#13692)
Otherwise they'll be leaked due to the filtering code only respecting
the stable identifiers for private read receipts.
2022-09-01 14:57:47 -04:00
Erik Johnston
9d2823ab70
Cache is_partial_state_room (#13693)
Fixes #13613.
2022-09-01 16:07:01 +01:00
Šimon Brandner
0e99f07952
Remove support for unstable private read receipts (#13653)
Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
2022-09-01 13:31:54 +01:00
Nick Mills-Barrett
42b11d5565
Remove cached wrap on _get_joined_users_from_context method (#13569)
The method doesn't actually do any data fetching and the method that
does, `_get_joined_profile_from_event_id`, has its own cache.

Signed off by Nick @ Beeper (@Fizzadar).
2022-08-31 12:19:39 +01:00
David Robertson
a160406d24
Fix admin List Room API return type on sqlite (#13509) 2022-08-31 10:38:16 +00:00
Eric Eastwood
92c5817e34
Give the correct next event when the message timestamps are the same - MSC3030 (#13658)
Discovered while working on https://github.com/matrix-org/synapse/pull/13589 and I had all the messages at the same timestamp in the tests.

Part of https://github.com/matrix-org/matrix-spec-proposals/pull/3030

Complement tests: https://github.com/matrix-org/complement/pull/457
2022-08-30 14:50:06 -05:00
Shay
20c76cecb9
Drop unused column application_services_state.last_txn (#13627) 2022-08-30 10:29:16 -07:00
Patrick Cloke
20df96a7a7
Speed up inserting event_push_actions_staging. (#13634)
By using `execute_values` instead of `execute_batch`.
2022-08-30 07:12:48 -04:00
Eric Eastwood
51d732db3b
Optimize how we calculate likely_domains during backfill (#13575)
Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).

There are 3 ways we currently calculate hosts that are in the room:

 1. `get_current_state` -> `get_domains_from_state`
    - Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
    - This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳
 1. `get_current_hosts_in_room`
    - Used for other federation things like sending read receipts and typing indicators
 1. `get_hosts_in_room_at_events`
    - Used when pushing out events over federation to other servers in the `_process_event_queue_loop`

Fix https://github.com/matrix-org/synapse/issues/13626

Part of https://github.com/matrix-org/synapse/issues/13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)


### Query performance

#### Before

The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
```
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 16035.612 ms (00:16.036)

synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 4243.237 ms (00:04.243)
```

But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.

```sh
$ psql synapse
synapse=# \timing on
synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
FROM current_state_events
WHERE
    type = 'm.room.member'
    AND membership = 'join'
    AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
  4130
(1 row)

Time: 13181.598 ms (00:13.182)

synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
 80814

synapse=# SELECT COUNT(*) from current_state_events;
  count
---------
 8162847

synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
 pg_size_pretty
----------------
 4702 MB
```

#### After

I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.

After a Postgres restart, it takes 330ms and running back to back takes 260ms.

```sh
$ psql synapse
synapse=# \timing on
Timing is on.
synapse=# SELECT
    substring(c.state_key FROM '@[^:]*:(.*)$') as host
FROM current_state_events c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE
    c.type = 'm.room.member'
    AND c.membership = 'join'
    AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
GROUP BY host
ORDER BY min(e.depth) ASC;
Time: 333.800 ms
```

#### Going further

To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up. 

Another thing we can do is optimize the query to use a index skip scan:

 - https://wiki.postgresql.org/wiki/Loose_indexscan
 - Index Skip Scan, https://commitfest.postgresql.org/37/1741/
 - https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
2022-08-30 01:38:14 -05:00
Eric Eastwood
d58615c82c
Directly lookup local membership instead of getting all members in a room first (get_users_in_room mis-use) (#13608)
See https://github.com/matrix-org/synapse/pull/13575#discussion_r953023755
2022-08-24 14:13:12 -05:00
Eric Eastwood
b93bd95e8a
When loading current ids, sort by stream_id to avoid incorrect overwrite and avoid errors caused by sorting alphabetical instance name which can be null (#13585)
When loading current ids, sort by stream ID so that we don't want to overwrite the `current_position` of an instance to a lower stream ID than we're actually at ([discussion](https://github.com/matrix-org/synapse/pull/13585#discussion_r951795379)). Previously, it sorted alphabetically by instance name which can be `null` and throw errors but more importantly, accomplishes nothing.

Fixes the following startup error which is why I started looking into this area:

```
$ poetry run synapse_homeserver --config-path homeserver.yaml
****************************************************************
 Error during initialisation:
    '<' not supported between instances of 'NoneType' and 'str'
 There may be more information in the logs.
****************************************************************
```

Somehow my database ended up looking like the following, notice the `instance_name` is `null` in the db, and we can't sort `NoneType` things. Another question is why do we see the `instance_name` as `null` sometimes instead of `master` in monolith mode?
```
$ psql synapse
synapse=# SELECT * FROM stream_positions;
   stream_name   | instance_name | stream_id
-----------------+---------------+-----------
 account_data    | master        |      1242
 events          | master        |      1787
 to_device       | master        |        58
 presence_stream | master        |    485638
 receipts        | master        |       341
 backfill        | master        |   -139106
(6 rows)
synapse=# SELECT instance_name, stream_id FROM receipts_linearized;
 instance_name | stream_id
---------------+-----------
               |       211
               |         3
               |         4
               |       212
               |       213
               |       224
               |       228
               |       164
               |       313
               |       253
               |        38
               |       321
               |       324
               |       189
               |       192
               |       193
               |       194
               |       195
               |       197
               |       198
               |       275
               |        79
               |       339
               |       340
               |        82
               |       341
               |        84
               |        85
               |        91
               |       119
```
2022-08-24 12:53:46 -05:00
Nick Mills-Barrett
b687010f89
Rewrite get push actions queries (#13597) 2022-08-24 10:12:51 +01:00