Skip to content

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation #9774

@matrixbot

Description

@matrixbot

This issue has been migrated from #9774.


Originally discovered on #synapse:matrix.org by @LTangaF

On Joel's server, doing the following DNS query times out;

root@5d0681f56cda:/# dig _matrix._tcp.matrix.lion.fm SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.matrix.lion.fm SRV
;; global options: +cmd
;; connection timed out; no servers could be reached

While a valid SRV record doesn't time out;

root@5d0681f56cda:/# dig _matrix._tcp.jboi.nl SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.jboi.nl SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 560
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_matrix._tcp.jboi.nl.          IN      SRV

;; ANSWER SECTION:
_matrix._tcp.jboi.nl.   120     IN      SRV     0 0 443 matrix.jboi.nl.

;; Query time: 40 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Thu Apr 08 20:50:06 UTC 2021
;; MSG SIZE  rcvd: 83

This is already odd, but synapse currently doesn't specify a timeout when looking up SRV records.

The offending snippet is this:

https://github.com/matrix-org/synapse/blob/c619253db80c8d1c606dc40756dd3c9e3a55a9fb/synapse/http/federation/srv_resolver.py#L137-L139

When the underlying DNS query times out, this does never complete, and it causes a federation transmission loop to "time out" the whole request, putting it on catchup.

twisted has the following interface for lookupService:

    def lookupService(name: str, timeout: Sequence[int]) -> "Deferred":
        """
        Perform an SRV record lookup.

        @param name: DNS name to resolve.
        @param timeout: Number of seconds after which to reissue the query.
            When the last timeout expires, the query is considered failed.

        @return: A L{Deferred} which fires with a three-tuple of lists of
            L{twisted.names.dns.RRHeader} instances.  The first element of the
            tuple gives answers.  The second element of the tuple gives
            authorities.  The third element of the tuple gives additional
            information.  The L{Deferred} may instead fail with one of the
            exceptions defined in L{twisted.names.error} or with
            C{NotImplementedError}.
        """

The optional parameter timeout defines that timeout, however, synapse isn't giving it any, so it never times out. Or synapse doesn't give it a strict enough timeout.

I propose adding a 15 second timeout by adding timeout=(15,) to the SrvResolver.resolve_service snippet.

Edit: The default resolver defines the timeouts of (1, 3, 11, 45), however, it adds these up with eachother, so it basically tries to resolve dns for exactly 60 seconds before giving up, and then it has a "timeout race" with the previously-established HTTP agent timeout (also of 60 seconds), which causes this DNS query to never promptly "time out" before it's overlaying "HTTP request timeout" could.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions