When we get behind on replication, we tend to stack up background processes
behind a linearizer. Bg processes are heavy (particularly with respect to
prometheus metrics) and linearizers aren't terribly efficient once the queue
gets long either.
A better approach is to maintain a queue of requests to be processed, and
nominate a single process to work its way through the queue.
Fixes: #7444
When considering rooms to clean up in `delete_old_current_state_events`, skip
rooms which we are creating, which otherwise look a bit like rooms we have
left.
Fixes#7834.
As far as I can tell from the sentry logs, the only time this has actually done
anything in the last two years is when we had two master workers running at
once, and even then, it made a bit of a mess of it (see
https://github.com/matrix-org/synapse/issues/7845#issuecomment-658238739).
Generally I feel like this code is doing more harm than good.
The replication client requires that arguments are given as keyword
arguments, which was not done in this case. We also pull out the logic
so that we can catch and handle any exceptions raised, rather than
leaving them unhandled.
When fetching the state of a room over federation we receive the event
IDs of the state and auth chain. We then fetch those events that we
don't already have.
However, we used a function that recursively fetched any missing auth
events for the fetched events, which can lead to a lot of recursion if
the server is missing most of the auth chain. This work is entirely
pointless because would have queued up the missing events in the auth
chain to be fetched already.
Let's just diable the recursion, since it only gets called from one
place anyway.
Fixes#2181.
The basic premise is that, when we
fail to reject an invite via the remote server, we can generate our own
out-of-band leave event and persist it as an outlier, so that we have something
to send to the client.
* Starting with apt 1.6, https support has moved into the main package and apt-transport-https has become a transitional dummy package.
Signed-off-by: Dirk Heinrichs <dirk.heinrichs@altum.de>
This table is no longer used, so we may as well stop populating it. Removing it
would prevent people rolling back to older releases of Synapse, so that can
happen in a future release.
* Fix spec compliance; tweaks without values are valid
(default to True, which is only concretely specified for
`highlight`, but it seems only reasonable to generalise)
* Changelog for 7766.
* Add documentation to `tweaks_for_actions`
May as well tidy up when I'm here.
* Add a test for `tweaks_for_actions`
Fixes https://github.com/matrix-org/synapse/issues/7641
The package was pinned to <0.8.0 without an obvious reasoning with
7ad1d7635
in https://github.com/matrix-org/synapse/pull/5636
while the version selection looks to just try to exclude an arbitrary
next minor version number that might introduce API breaking changes.
Selecting the next minor number might be a good conservative selection.
Downstream distributions already reported success patching out the version
requirements.
This also fixes the integration of upgraded packages into openSUSE packages,
e.g. for openSUSE Tumbleweed which already ships prometheus_client >= 0.8 .
Signed-off-by: Oliver Kurz <okurz@suse.de>
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
==============================
Synapse 1.16.0rc2 includes the security fixes released with Synapse 1.15.2.
Please see [below](https://github.com/matrix-org/synapse/blob/master/CHANGES.md#synapse-1152-2020-07-02) for more details.
Improved Documentation
----------------------
- Update postgres image in example `docker-compose.yaml` to tag `12-alpine`. ([\#7696](https://github.com/matrix-org/synapse/issues/7696))
Internal Changes
----------------
- Add some metrics for inbound and outbound federation latencies: `synapse_federation_server_pdu_process_time` and `synapse_event_processing_lag_by_event`. ([\#7771](https://github.com/matrix-org/synapse/issues/7771))
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEF3tZXk38tRDFVnUIM/xY9qcRMEgFAl79+qgACgkQM/xY9qcR
MEhcaRAAjWLW3ojN1F0DUfE85jziZK2VdnMQC3g+uEOLX6QRbfqFNaNNMjLdK+vl
K/+2ZoHkRsg6g8noSPhPmI1z1+hb5xDJaxjltzHxonIipW8XSU8o2PQMkf8O/BAy
VS58y3GyLkhEgzWC+/hcII+LBgcqXpLuNM0xrKTHmxclIjdewlwe1v+hxkP+6wsX
9Whhn1f4sNHrCtyFVK9uzMFcVyzcQaiWZRjEDMj2uR7rWT6UbCUifN/G4fWmtGbY
xWoNoC4Qv8xiqXOG4U7juPp9T3bRyWMKyjBFM5PWO6Ec2zfafDyFzhBxJhlQhODG
g21tS4PowX/dM/pBpJFEOPh1BVrPZzzTD+YMmTcd3NO79HeaQGqEX/+tzFCFUyPp
0daJK3Y85+l5w/M09WU8DDN8CiR3PFJyGDIZp+nweMsiJZkbEbLOkh1tx6TL+5/6
zwewU6cq8nTVGrn53Tn58l8C7Sj4w+Qk+1XDzymAoidyoWqAKW9Y/fw53PaViUSx
voDu0rpsEUXR1OzCBG8SAPQCFy9gdEWV04OvIpzHuq2uojkz66f7NAXy+Wz+Occ9
AYb/s6Ei80bGCLgRd5jg+myqavwRbzCyv+LIC6dxpopxZJ3AzrFuD11eXKtrIxOC
FZYf3U4KeBk4Q9TV5IFV1xcGFrq5aK36LdmP6WOsEl3PXVT9p/Q=
=YaJn
-----END PGP SIGNATURE-----
Merge tag 'v1.16.0rc2' into develop
Synapse 1.16.0rc2 (2020-07-02)
==============================
Synapse 1.16.0rc2 includes the security fixes released with Synapse 1.15.2.
Please see [below](https://github.com/matrix-org/synapse/blob/master/CHANGES.md#synapse-1152-2020-07-02) for more details.
Improved Documentation
----------------------
- Update postgres image in example `docker-compose.yaml` to tag `12-alpine`. ([\#7696](https://github.com/matrix-org/synapse/issues/7696))
Internal Changes
----------------
- Add some metrics for inbound and outbound federation latencies: `synapse_federation_server_pdu_process_time` and `synapse_event_processing_lag_by_event`. ([\#7771](https://github.com/matrix-org/synapse/issues/7771))
- Remove the requirement for a specific version of Python
- Move dep comment to a separate line, Tox 3.7.0 like trailing ones
Signed-off-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
State res v2 across large data sets can be very CPU intensive, and if
all the relevant events are in the cache the algorithm will run from
start to finish within a single reactor tick. This can result in
blocking the reactor tick for several seconds, which can have major
repercussions on other requests.
To fix this we simply add the occaisonal `sleep(0)` during iterations to
yield execution until the next reactor tick. The aim is to only do this
for large data sets so that we don't impact otherwise quick resolutions.=
HTTP requires the response to contain a Content-Length header unless chunked encoding is being used.
Prometheus metrics endpoint did not set this, causing software such as prometheus-proxy to not be able to scrape synapse for metrics.
Signed-off-by: Christian Svensson <blue@cmd.nu>
Older versions of `parameterized` package have no `parameterized_class` decorator. This decorator is used in tests.
Signed-off-by: Oleg Girko <ol@infoserver.lv>
* Always return an unread_count in get_unread_event_push_actions_by_room_for_user
* Don't always expect unread_count to be there so we don't take out sync entirely if something goes wrong
This requires a new config option to specify which media repo should be
responsible for running background jobs to e.g. clear out expired URL
preview caches.
The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
This ended up being a bit more invasive than I'd hoped for (not helped by
generic_worker duplicating some of the code from homeserver), but hopefully
it's an improvement.
The idea is that, rather than storing unstructured `dict`s in the config for
the listener configurations, we instead parse it into a structured
`ListenerConfig` object.
Fixes https://github.com/matrix-org/synapse/issues/7683
Broke in: #7649
We had a `yield` acting on a coroutine. To be fair this one is a bit difficult to notice as there's a function in the middle that just passes the coroutine along.
The spec [states](https://matrix.org/docs/spec/client_server/r0.6.1#phone-number) that `m.id.phone` requires the field `country` and `phone`.
In Synapse, we've been enforcing `country` and `number`.
I am not currently sure whether this affects any client implementations.
This issue was introduced in #1994.
Fixes https://github.com/matrix-org/synapse/issues/2431
Adds config option `encryption_enabled_by_default_for_room_type`, which determines whether encryption should be enabled with the default encryption algorithm in private or public rooms upon creation. Whether the room is private or public is decided based upon the room creation preset that is used.
Part of this PR is also pulling out all of the individual instances of `m.megolm.v1.aes-sha2` into a constant variable to eliminate typos ala https://github.com/matrix-org/synapse/pull/7637
Based on #7637
* Ensure account data stream IDs are unique.
The account data stream is shared between three tables, and the maximum
allocated ID was tracked in a dedicated table. Updating the max ID
happened outside the transaction that allocated the ID, leading to a
race where if the server was restarted then the same ID could be
allocated but the max ID failed to be updated, leading it to be reused.
The ID generators have support for tracking across multiple tables, so
we may as well use that instead of a dedicated table.
* Fix bug in account data replication stream.
If the same stream ID was used in both global and room account data then
the getting updates for the replication stream would fail due to
`heapq.merge(..)` trying to compare a `str` with a `None`. (This is
because you'd have two rows like `(534, '!room')` and `(534, None)` from
the room and global account data tables).
Fix is just to order by stream ID, since we don't rely on the ordering
beyond that. The bug where stream IDs can be reused should be fixed now,
so this case shouldn't happen going forward.
Fixes#7617
While working on https://github.com/matrix-org/synapse/issues/5665 I found myself digging into the `Ratelimiter` class and seeing that it was both:
* Rather undocumented, and
* causing a *lot* of config checks
This PR attempts to refactor and comment the `Ratelimiter` class, as well as encourage config file accesses to only be done at instantiation.
Best to be reviewed commit-by-commit.
Calls `self.get_success` on all deferred methods instead of abusing `self.pump()`. This has the benefit of working with coroutines, as well as checking that method execution completed successfully.
There are also a few small cleanups that I made in the process.
* Expose `return_html_error`, and allow it to take a Jinja2 template instead of a raw string
* Clean up exception handling in SAML2ResponseResource
* use the existing code in `return_html_error` instead of re-implementing it
(giving it a jinja2 template rather than inventing a new form of template)
* do the exception-catching in the REST layer rather than in the handler
layer, to make sure we catch all exceptions.
It looks like `user_device_resync` was ignoring cross-signing keys from the results received from the remote server. This patch fixes this, by processing these keys using the same process `_handle_signing_key_updates` does (and effectively factor that part out of that function).
The query keeps showing up in my slow query log.
This changes the plan under the top-level Sort node from
```
WindowAgg (cost=280335.88..292963.15 rows=561212 width=80) (actual time=138.651..160.562 rows=27112 loops=1)
-> Sort (cost=280335.88..281738.91 rows=561212 width=84) (actual time=138.597..140.622 rows=27112 loops=1)
Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
Sort Method: quicksort Memory: 4581kB
-> Nested Loop (cost=2.83..226745.22 rows=561212 width=84) (actual time=21.548..47.657 rows=27112 loops=1)
-> HashAggregate (cost=2.27..3.28 rows=101 width=8) (actual time=21.526..21.535 rows=20 loops=1)
Group Key: state.state_group
-> CTE Scan on state (cost=0.00..2.02 rows=101 width=8) (actual time=21.280..21.493 rows=20 loops=1)
-> Index Scan using state_groups_state_type_idx on state_groups_state (cost=0.56..2189.40 rows=5557 width=84) (actual time=0.005..0.991 rows=1356 loops=20)
Index Cond: (state_group = state.state_group)
```
to
```
Nested Loop (cost=2.83..226745.22 rows=561212 width=84) (actual time=24.194..52.834 rows=27112 loops=1)
-> HashAggregate (cost=2.27..3.28 rows=101 width=8) (actual time=24.130..24.138 rows=20 loops=1)
Group Key: state.state_group
-> CTE Scan on state (cost=0.00..2.02 rows=101 width=8) (actual time=23.887..24.113 rows=20 loops=1)
-> Index Scan using state_groups_state_type_idx on state_groups_state (cost=0.56..2189.40 rows=5557 width=84) (actual time=0.016..1.159 rows=1356 loops=20)
Index Cond: (state_group = state.state_group)
```
This cuts the execution time from ~190ms to ~130ms, i.e. a reduction
of ~30%.
The full plans are visualised at https://explain.depesz.com/s/WpbT and
https://explain.depesz.com/s/KlEk
Signed-off-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Without this patch, if an error happens which isn't caught by `user_device_resync`, then `_maybe_retry_device_resync` would fail, without retrying the next users in the iteration. This patch fixes this so that it now only logs an error in this case.
Synapse was added to the ports tree in Nov, 2019 by Renaud Allard (https://marc.info/?l=openbsd-ports&m=157417848805329).
With the release of OpenBSD 6.7 on May 22, 2020 a pre-compiled binary is available as well.
'client_auth_method' commented out value was erronously 'client_auth_basic',
when code and docstring says it should be 'client_secret_basic'.
Signed-off-by: Jason Robinson <jasonr@matrix.org>
The bg update never managed to complete, because it kept being interrupted by
transactions which want to take a lock.
Just doing it in the foreground isn't that bad, and is a good deal simpler.
A couple of changes of significance:
* remove the `_last_ack < federation_position` condition, so that
updates will still be correctly processed after restart
* Correctly wire up send_federation_ack to the right class.
we can use `make_in_list_sql_clause` rather than doing our own half-baked
equivalent, which has the benefit of working just fine with empty lists.
(This has quite a lot of tests, so I think it's pretty safe)
The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room).
Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on.
People probably want to look at this commit by commit.
Instead of doing a complicated dance of deleting and moving aliases one
by one, which sends a canonical alias update into the old room for each
one, lets do it all in one go.
This also changes the function to move *all* local alias events to the new
room, however that happens later on anyway.
PyPy's gc.get_stats() returns an object containing detailed allocator statistics
which could be beneficial to collect as metrics.
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.
https://github.com/matrix-org/synapse/pull/6776 introduced some code infrastructure to mark device lists as stale/out of sync.
This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.
It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`.
Fixes#7418
We don't really make any promises about returning accurate presence data when
presence is disabled, so we may as well just return a static response, rather
than making the master handle a request.
`_is_server_still_joined` will throw if it is given state updates with non-user ID state keys with local user leaves. This is actually rarely a problem since local leaves almost always get persisted by themselves.
(I discovered this on a branch that was otherwise broken, so I haven't seen this in the wild)
* If an error occurs when stopping a process synctl now logs a warning.
* During a restart, synctl will avoid attempting to start Synapse if an error
occurs during stopping Synapse.
Make sure that the AccountDataStream presents complete updates, in the right
order.
This is much the same fix as #7337 and #7358, but applied to a different stream.
This is required as both event persistence and the background update needs access to this function. It should be perfectly safe for two workers to write to that table at the same time.
This allows us to have the logic on both master and workers, which is necessary to move event persistence off master.
We also combine the instantiation of ID generators from DataStore and slave stores to the base worker stores. This allows us to select which process writes events independently of the master/worker splits.
The specific headers that are passed using this new configuration format
are Host and X-Forwarded-For, which should be all that's required.
Note that for production another matcher should be added in the first
section to properly handle the base_url lookup:
reverse_proxy /.well-known/matrix/* http://localhost:8008
Signed-off-by: Jeff Peeler <jpeeler@gmail.com>
The aim here is to get to a stage where we have a `PersistEventStore` that holds all the write methods used during event persistence, so that we can take that class out of the `DataStore` mixin and instansiate it separately. This will allow us to instansiate it on processes other than master, while also ensuring it is only available on processes that are configured to write to events stream.
This is a bit of an architectural change, where we end up with multiple classes per data store (rather than one per data store we have now). We end up having:
1. Storage classes that provide high level APIs that can talk to multiple data stores.
2. Data store modules that consist of classes that must point at the same database instance.
3. Classes in a data store that can be instantiated on processes depending on config.
Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands.
Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
* release-v1.13.0:
Don't UPGRADE database rows
RST indenting
Put rollback instructions in upgrade notes
Fix changelog typo
Oh yeah, RST
Absolute URL it is then
Fix upgrade notes link
Provide summary of upgrade issues in changelog. Fix )
Move next version notes from changelog to upgrade notes
Changelog fixes
1.13.0rc1
Documentation on setting up redis (#7446)
Rework UI Auth session validation for registration (#7455)
Fix errors from malformed log line (#7454)
Drop support for redis.dbid (#7450)