* Remove redundant WrappedConnection
The matrix federation client uses an HTTP connection pool, which times out its
idle HTTP connections, so there is no need for any of this business.
* Correctly retry and back off if we get a HTTPerror response
* Refactor request sending to have better excpetions
MatrixFederationHttpClient blindly reraised exceptions to the caller
without differentiating "expected" failures (e.g. connection timeouts
etc) versus more severe problems (e.g. programming errors).
This commit adds a RequestSendFailed exception that is raised when
"expected" failures happen, allowing the TransactionQueue to log them as
warnings while allowing us to log other exceptions as actual exceptions.
- Improve logging: log things in the right order, include destination and txids
in all log lines, don't log successful responses twice
- Fix the docstring on TransportLayerClient.send_transaction
- Don't use treq.request, which is overcomplicated for our purposes: just use a
twisted.web.client.Agent.
- simplify the logic for setting up the bodyProducer
- fix bytes/str confusions
If a connection is lost before a request is read from Request, Twisted
sets `method` (and `uri`) attributes to dummy values. These dummy values
have incorrect types (i.e. they're not bytes), and so things like
`__repr__` would raise an exception.
To fix this we had a helper method to return the method with a
consistent type.
If a HTTP handler throws an exception while processing a request we
automatically write a JSON error response. If the handler had already
started writing a response twisted throws an exception.
We should check for this case and simple abort the connection if there
was an error after the response had started being written.
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.
The new helper function doesn't suffer from this problem.
We want to wait until we have read the response body before we log the request
as complete, otherwise a confusing thing happens where the request appears to
have completed, but we later fail it.
To do this, we factor the salient details of a request out to a separate
object, which can then keep track of the txn_id, so that it can be logged.
The problem with dumping all of the json response into the Request object at
once is that doing so starts the timeout for the next request to be received:
so if it takes longer than 60s to stream back the response to the client, the
client never gets it.
The correct solution is to use a Producer; then the timeout is only started
once all of the content is sent over the TCP connection.
This commit moves a bunch of the logic for deciding when to log the receipt and
completion of HTTP requests into SynapseRequest, rather than in the request
handling wrappers.
Advantages of this are:
* we get logs for *all* requests (including OPTIONS and HEADs), rather than
just those that end up hitting handlers we've remembered to decorate
correctly.
* when a request handler wires up a Producer (as the media stuff does
currently, and as other things will do soon), we log at the point that all
of the traffic has been sent to the client.
This code brings the SimpleHttpClient into line with the
MatrixFederationHttpClient by having it raise HttpResponseExceptions when a
request fails (rather than trying to parse for matrix errors and maybe raising
MatrixCodeMessageException).
Then, whenever we were checking for MatrixCodeMessageException and turning them
into SynapseErrors, we now need to check for HttpResponseExceptions and call
to_synapse_error.
We really shouldn't be sending all CodeMessageExceptions back over the C-S API;
it will include things like 401s which we shouldn't proxy.
That means that we need to explicitly turn a few HttpResponseExceptions into
SynapseErrors in the federation layer.
The effect of the latter is that the matrix errcode will get passed through
correctly to calling clients, which might help with some of the random
M_UNKNOWN errors when trying to join rooms.
We need to do a bit more validation when we get a server name, but don't want
to be re-doing it all over the shop, so factor out a separate
parse_and_validate_server_name, and do the extra validation.
Also, use it to verify the server name in the config file.
Make sure that server_names used in auth headers are sane, and reject them with
a sensible error code, before they disappear off into the depths of the system.
otherwise we explode with:
```
Traceback (most recent call last):
File /usr/lib/python2.7/logging/handlers.py, line 78, in emit
logging.FileHandler.emit(self, record)
File /usr/lib/python2.7/logging/__init__.py, line 950, in emit
StreamHandler.emit(self, record)
File /usr/lib/python2.7/logging/__init__.py, line 887, in emit
self.handleError(record)
File /usr/lib/python2.7/logging/__init__.py, line 810, in handleError
None, sys.stderr)
File /usr/lib/python2.7/traceback.py, line 124, in print_exception
_print(file, 'Traceback (most recent call last):')
File /usr/lib/python2.7/traceback.py, line 13, in _print
file.write(str+terminator)
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_io.py, line 170, in write
self.log.emit(self.level, format=u{log_io}, log_io=line)
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_logger.py, line 144, in emit
self.observer(event)
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_observer.py, line 136, in __call__
errorLogger = self._errorLoggerForObserver(brokenObserver)
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_observer.py, line 156, in _errorLoggerForObserver
if obs is not observer
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_observer.py, line 81, in __init__
self.log = Logger(observer=self)
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_logger.py, line 64, in __init__
namespace = self._namespaceFromCallingContext()
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/logger/_logger.py, line 42, in _namespaceFromCallingContext
return currentframe(2).f_globals[__name__]
File /home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/compat.py, line 93, in currentframe
for x in range(n + 1):
RuntimeError: maximum recursion depth exceeded while calling a Python object
Logged from file site.py, line 129
File /usr/lib/python2.7/logging/__init__.py, line 859, in emit
msg = self.format(record)
File /usr/lib/python2.7/logging/__init__.py, line 732, in format
return fmt.format(record)
File /usr/lib/python2.7/logging/__init__.py, line 471, in format
record.message = record.getMessage()
File /usr/lib/python2.7/logging/__init__.py, line 335, in getMessage
msg = msg % self.args
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
Logged from file site.py, line 129
```
...where the logger apparently recurses whilst trying to log the error, hitting the
maximum recursion depth and killing everything badly.
When we finish processing a request, log the number of events we fetched from
the database to handle it.
[I'm trying to figure out which requests are responsible for large amounts of
event cache churn. It may turn out to be more helpful to add counts to the
prometheus per-request/block metrics, but that is an extension to this code
anyway.]
(instead of everywhere that writes a response. Or rather, the subset of places
which write responses where we haven't forgotten it).
This also means that we don't have to have the mysterious version_string
attribute in anything with a request handler.
Unfortunately it does mean that we have to pass the version string wherever we
instantiate a SynapseSite, which has been c&ped 150 times, but that is code
that ought to be cleaned up anyway really.
This is useful in its own right, because server.py is full of stuff; but more
importantly, I want to do some refactoring that will cause a circular reference
as it is.
It is especially important that sync requests don't get cached, as if a
sync returns the same token given then the client will call sync with
the same parameters again. If the previous response was cached it will
get reused, resulting in the client tight looping making the same
request and never making any progress.
In general, clients will expect to get up to date data when requesting
APIs, and so its safer to do a blanket no cache policy than only
whitelisting APIs that we know will break things if they get cached.
It's useful to know when there are peaks in incoming requests - which isn't
quite the same as there being peaks in outgoing responses, due to the time
taken to handle requests.
Add federation_domain_whitelist
gives a way to restrict which domains your HS is allowed to federate with.
useful mainly for gracefully preventing a private but internet-connected HS from trying to federate to the wider public Matrix network
Avoid throwing a (harmless) exception when we try to write an error response to
an http request where the client has disconnected.
This comes up as a CRITICAL error in the logs which tends to mislead people
into thinking there's an actual problem
For each request, track the amount of time spent waiting for a db
connection. This entails adding it to the LoggingContext and we may as well add
metrics for it while we are passing.
In order to circumvent the number of duplicate foo:count metrics increasing
without bounds, it's time for a rearrangement.
The following are all deprecated, and replaced with synapse_util_metrics_block_count:
synapse_util_metrics_block_timer:count
synapse_util_metrics_block_ru_utime:count
synapse_util_metrics_block_ru_stime:count
synapse_util_metrics_block_db_txn_count:count
synapse_util_metrics_block_db_txn_duration:count
The following are all deprecated, and replaced with synapse_http_server_response_count:
synapse_http_server_requests
synapse_http_server_response_time:count
synapse_http_server_response_ru_utime:count
synapse_http_server_response_ru_stime:count
synapse_http_server_response_db_txn_count:count
synapse_http_server_response_db_txn_duration:count
The following are renamed (the old metrics are kept for now, but deprecated):
synapse_util_metrics_block_timer:total ->
synapse_util_metrics_block_time_seconds
synapse_util_metrics_block_ru_utime:total ->
synapse_util_metrics_block_ru_utime_seconds
synapse_util_metrics_block_ru_stime:total ->
synapse_util_metrics_block_ru_stime_seconds
synapse_util_metrics_block_db_txn_count:total ->
synapse_util_metrics_block_db_txn_count
synapse_util_metrics_block_db_txn_duration:total ->
synapse_util_metrics_block_db_txn_duration_seconds
synapse_http_server_response_time:total ->
synapse_http_server_response_time_seconds
synapse_http_server_response_ru_utime:total ->
synapse_http_server_response_ru_utime_seconds
synapse_http_server_response_ru_stime:total ->
synapse_http_server_response_ru_stime_seconds
synapse_http_server_response_db_txn_count:total ->
synapse_http_server_response_db_txn_count
synapse_http_server_response_db_txn_duration:total
synapse_http_server_response_db_txn_duration_seconds
If somebody sends us a request where the the body is invalid utf-8, we should
return a 400 rather than a 500. (json.loads throws a UnicodeError in this
situation)
We might as well catch all Exceptions here: it seems very unlikely that we
would get a request that *isn't caused by invalid json.
preserve_context_over_fn uses a ContextPreservingDeferred, which only restores
context for the duration of its callbacks, which isn't really correct, and
means that subsequent operations in the same request can end up without their
logcontexts.
When we proxy a media request to a remote server, add a query-param, which will
tell the remote server to 404 if it doesn't recognise the server_name.
This should fix a routing loop where the server keeps forwarding back to
itself.
Also improves the error handling on remote media fetches, so that we don't
always return a rather obscure 502.
The abort() method calls loseConnection() which tries to shutdown the
TLS connection cleanly. We now call abortConnection() directly which
should promptly close both the TLS connection and the underlying TCP
connection.
I also added some TODO markers to consider cancelling the old previous
timeout rather than checking time.time(). But given how urgently we want
to get this code released I'd rather leave the existing code with the
duplicate timeouts and the time.time() check.
Add a timeout parameter for controlling how long synapse will wait
for responses from remote servers. For servers that fail include how
they failed to make it easier to debug.
Fetch keys from different servers in parallel rather than in series.
Set the default timeout to 10s.
Always set the config key with an empty list, even if a list isn't specified.
This means that the codepaths are the same for both the empty list and
for a missing key. Since the behaviour is the same for both cases this
makes the code somewhat easier to reason about.