Skip to content

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 20, 2025

Try dumping the whole response in Request.write directly

Spawning from #17722 (comment) in order to try to reproduce the problem described by the following comment.

The current reasoning why we use a Twisted Producer is in this comment. This comment was introduced in matrix-org/synapse#10905 but this knowledge is actually from the PR description of matrix-org/synapse#3701.

# The problem with dumping all of the response into the `Request` object at
# once (via `Request.write`) is that doing so starts the timeout for the
# next request to be received: so if it takes longer than 60s to stream back
# the response to the client, the client never gets it.

Originally, we introduced the _ByteProducer so that we could iteratively encode JSON and send it over the wire to avoid blocking the Twisted reactor while serializing large JSON objects. It was initially a pull producer and was changed to a push producer. In matrix-org/synapse#10905, we stopped using the iterative JSON encoder because it's slow Python implementation and now use the standard JsonEncoder.encode functions (which are backed by a C library) in a background thread to not block the Twisted reactor.

So it seems like the only reasoning left for using a Twisted Producer is the remaining comment above. This behavior was described as a "bug" by @glyph (founder of the Twisted project) so if we can get this problem solved in Twisted, we can eventually remove our Producer usage.

There's definitely a bug where request.write(big) should not time you out and close the connection if it takes a long time to write big. Have you reported that one?

-- @glyph, #17722 (comment)

This is useful because it eliminates complexity and one variable that may be contributing to #17722

But overall, using a Producer doesn't seem like a completely wrong thing given this description in the docs:

The purpose of this guide is to describe the Twisted producer and consumer system. The producer system allows applications to stream large amounts of data in a manner which is both memory and CPU efficient, and which does not introduce a source of unacceptable latency into the reactor.

-- https://docs.twisted.org/en/twisted-25.5.0/core/howto/producers.html

Update: This "bug" behavior is now tracked as a bug in the Twisted repo -> twisted/twisted#12498 (Idle connection timeout incorrectly enforced while sending large response with Request.write(...))

Dev notes

Commit history:

Other smoke:


Twisted Producers: https://docs.twisted.org/en/stable/core/howto/producers.html

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@@ -909,6 +909,9 @@ def _write_bytes_to_request(request: Request, bytes_to_write: bytes) -> None:
large response bodies.
"""

Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What timeout is this comment referring to?

# The problem with dumping all of the response into the `Request` object at
# once (via `Request.write`) is that doing so starts the timeout for the
# next request to be received: so if it takes longer than 60s to stream back
# the response to the client, the client never gets it.

cc @richvdh where this idea originally came from in matrix-org/synapse#3693 / matrix-org/synapse#3701
cc @erikjohnston for introducing this comment in the codebase matrix-org/synapse#10844 / matrix-org/synapse#10905

(see PR description for more context)

As far as I can tell, we don't timeout at all on the Synapse level (and I don't think Twisted does either).

Is this some sort of reverse-proxy layer timeout interaction we're talking about here?

For example with nginx:

proxy_connect_timeout       60;
proxy_send_timeout          60;
proxy_read_timeout          60;
send_timeout                60;

Or HAProxy:

# wait for 5s when connecting to a server
timeout connect 5s

# ... but if there is a backlog of requests, wait for 60s before returning a 500
timeout queue 60s

# close client connections 5m after the last request
# (as recommened by https://support.cloudflare.com/hc/en-us/articles/212794707-General-Best-Practices-for-Load-Balancing-with-CloudFlare)
timeout client 900s

# give clients 5m between requests (otherwise it defaults to the value of 'timeout http-request')
timeout http-keep-alive 900s

# give clients 10s to complete a request (either time between handshake and first request, or time spent sending headers)
timeout http-request 10s

# time out server responses after 180s
timeout server 180s

Copy link
Member

@richvdh richvdh Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the request timeout within twisted. Twisted docs:

The initial value of timeOut, which defines the idle connection timeout in seconds, or None to disable the idle timeout.

That is then passed into HTTPChannel, which starts and stops the timeout at appropriate times.

Looking again at what we had before matrix-org/synapse#3701, which was effectively:

    request.write(json_bytes)
    request.finish()

request.write passes its data down into ITransport.write, which says

No data will ever be lost, although (obviously) the connection may be closed before it all gets through.

In other words, if json_bytes is large, it may get buffered, in memory, within the transport.

request.finish() then re-starts Twisted's idle connection timer for the HTTPChannel. So, it's entirely possible that the idle timer elapses before we have finished flushing out all the response bytes. Hence, the problem that PR set out to fix.

Of course, it's possible this has changed in the intervening 7 years, but I don't think so.

(And yes, I probably should have reported it to Twisted at the time. My bad.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @richvdh 🙏

I was able to make a reproduction case for this problem and created a bug report in the Twisted repo -> twisted/twisted#12498

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Aug 21, 2025

Closing as the point of this pull request was to try reproduce the bug described by this comment

# The problem with dumping all of the response into the `Request` object at
# once (via `Request.write`) is that doing so starts the timeout for the
# next request to be received: so if it takes longer than 60s to stream back
# the response to the client, the client never gets it.

I was able to make a minimal reproduction case and reported it upstream as a bug in the Twisted repo -> twisted/twisted#12498 (updated our comment in the code to reference this in #18855)

We will have to wait for a resolution in Twisted, then wait for a release, and most likely even more years after as our deprecation policy dictates we support old versions as far as I can tell.

@glyph
Copy link
Contributor

glyph commented Aug 21, 2025

We will have to wait for a resolution in Twisted, then wait for a release, and most likely even more years after as our deprecation policy dictates we support old versions as far as I can tell.

For what it's worth, Twisted does not support patch releases of old versions. If you want security support from us, you need to upgrade to the more recent versions. So supporting years-old versions of Twisted does encourage people to run unpatched infrastructure. (We don't have a lot of security patches but nonetheless, the idea is that our compatibility policy is such that if you're not getting lots of deprecation warnings from your tests, upgrading should be painless. There's always a week window for new releases too, where you can report any incompatibilities you experience, and we can fix them before finalizing the release candidate.)

@richvdh
Copy link
Member

richvdh commented Aug 22, 2025

So supporting years-old versions of Twisted does encourage people to run unpatched infrastructure.

The reason Synapse supports old versions of Twisted is really just to help downstream packagers (Debian, Fedora, etc). Those distros often contain older versions of Twisted, so if Synapse required the bleeding edge of Twisted, it would be much harder to package.

The idea is that the distros, by choosing to package an old version of Twisted, take on the responsibility for backporting security fixes. Of course, there's debate about whether that's a good approach, particularly given Twisted's backwards-compatibility guarantees, but it's where we are.

If there's a compelling reason to require a specific version of a specific dependency, then we might bump the requirement anyway, but to be honest we've been on the receiving end quite a lot of shouting from downstream packagers about unnecessary dependencies, so we tend to be pretty conservative about requiring specific versions.

Element's own Debian and Docker packages for Synapse, which include all of the python dependencies, do use the most recent release of Twisted, and nowadays we don't really encourage managing your own installation with pip or poetry or uv or whatever's in fashion this month, unless you're a developer (in which case, you get to keep both parts when it breaks).

The last security fix in Twisted that I can remember affecting us was the request-smuggling one. It looks like we built new packages for Synapse; meanwhile Debian backported the fixes.

@richvdh
Copy link
Member

richvdh commented Aug 22, 2025

[Having said all that, Synapse is no longer included in Debian stable, mostly because nobody had time to backport Synapse's critical fixes. I haven't looked at the situation in Fedora or other distros recently, but maybe it's time to reconsider the policy of being tolerant of old versions of dependencies.]

MadLittleMods added a commit that referenced this pull request Aug 27, 2025
…ed while sending large response with `Request.write(...)` (#18855)

Link upstream Twisted bug ->
twisted/twisted#12498

Spawning from #18852
MadLittleMods added a commit that referenced this pull request Sep 15, 2025
Clarify Python dependency constraints

Spawning from
#18852 (comment)
as I don't actually know the the exact rule of thumb. It's unclear to me
what we care about exactly. Our [deprecation
policy](https://element-hq.github.io/synapse/latest/deprecation_policy.html)
mentions Debian oldstable support at-least for the version of SQLite.
But then we only refer to Debian stable for the Twisted dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants