okhttp: tsan socket data race bug fix #10279

tonyjongyoonan · 2023-06-13T18:29:08Z

The original TSAN data race described in #10228 was fixed by rewriting a test case that used a PipeSocket which triggered a bug in Java's Socket as described in #10281. Doing so uncovered the "original" data race bug: #10284.

ejona86

Looks fair. One question.

ejona86 · 2023-06-13T18:52:52Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpServerTransport.java

-          config.handshakerSocketFactory.handshake(bareSocket, Attributes.EMPTY);
-      Socket socket = result.socket;
+          config.handshakerSocketFactory.handshake(socket, Attributes.EMPTY);
+      socket.setTcpNoDelay(true);


Move this back before handshake()? Given we talked about this, I assume something didn't like it.

Oh, I see #10278 (comment) now. Let me look.

That TSAN failure looks like a bug in Java's Socket. Glancing at the JDK, it is synchronized. But that was a recent change. It was fixed really recently, as seen with blame. https://bugs.openjdk.org/browse/JDK-8278326

So that failure looks like it is already fixed upstream; it just needs to sync internally. But it also looks like all previous versions of Socket were broken. This makes shutdown overall broken. Eww...

Let's move the setTcpNoDelay(true) up before handshake(), but put it in a synchronized. That will then either 1) initializes the socket, and guarantee it doesn't run concurrently with shutdown() or 2) throw, because shutdown() already closed the socket. And we should have a comment linking to that JDK issue.

Looking back at #10228 , the same problem was happening there. Could you run a test with this.socket = result.socket; commented out to see if the "synchronized setTcpNoDelay" fixes the problem by itself? Commenting out this.socket = result.socket; is the equivalent of swapping back to the "final bareSocket". It'd take some work for me to find the TLS race I saw before, and I wonder if the problem was actually just always this Socket bug.

Yes, keeping the socket.setTcpNoDelay(true) before handshake() results in other data race errors

Neither seems to fix the issue.

Commenting out this.socket = result.socket with a synchronized lock to the setTcpNoDelay (moved back up) gives us 112/1000 failures.

And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.

In both cases, a synchronized lock on setTcpNoDelay(true) seems to bring the failures down from ~300/1000 to ~100/1000.

It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true) is either moved down or removed entirely.

And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.

Ugh. I'm realizing that failure is a bug in the test. The race isn't with shutdown there. The "client" (in the test) calls close and it is considering that a race. In real life that would come from the kernel. What's even worse is PipeSocket doesn't actually use any Socket resources; it is a fake. We'll probably need to replace that with a real socket. That also invalidates some of the testing you've done up to this point.

Try to reproduce a failure using #10281 . Assuming that makes things look better, I can send that out as a PR. See go/grpc-prod/java/importing#cherry-picking-prs for how to copy it to your client.

We first noticed the TSAN failure when I saw a TLS handshake failure. Later, Ivy created the issue she assigned to you with a recent failure, but it was a different failure than what I saw. The failure in that issue may be a test problem, and so assuming my change fixes that test problem you'll want to run some more tests to see if you can trigger other failures. (We can chat about this over GVC)

It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true) is either moved down or removed entirely.

That's probably because handshake() does very little in these tests.

We've been barking up the wrong tree. #10281 will fix the TSAN failure you were looking at. The problem there shouldn't be fixed by changing the OkHttpServerTransport. However, I reproduced #10284 and that issue does need a fix like we have here.

tonyjongyoonan · 2023-06-15T22:07:57Z

It seems like this change doesn't fully fix the data races, but reduces them from occurring ~70% of the time to ~9%.

ejona86 · 2023-06-15T22:54:31Z

That second link (to the 9%) looks wrong.

tonyjongyoonan · 2023-06-16T03:21:50Z

That second link (to the 9%) looks wrong.

Sorry, it's fixed now.

ejona86 · 2023-06-16T14:40:56Z

It seems like this change doesn't fully fix the data races, but reduces them from occurring ~70% of the time to ~9%.

That is worrying because it is an issue in the SSLSocketImpl.close() flow that we were swapping to. But it appears to be pre-existing and a different code path. Run 5 in the 70% run had the same javax.crypto.Cipher. With the more frequent problem fixed, we're now able to notice a rarer problem that was previously in the noise. We can file an issue for that and track it separately.

YifeiZhuang

As discussed offline, having a separate issue tracking the new problem sounds fair to me.

ejona86 · 2023-06-21T20:28:54Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpServerTransport.java

  private void startIo(SerializingExecutor serializingExecutor) {
    try {
-      bareSocket.setTcpNoDelay(true);
+      synchronized (lock) {


Do we still need this synchronized? If we do, we're going to need a comment as to why (probably a link to the Java bug).

Yes--and just added the comment.

Let's explain a bit more.

The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().

ejona86 · 2023-06-22T14:32:36Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpServerTransport.java

  private void startIo(SerializingExecutor serializingExecutor) {
    try {
-      bareSocket.setTcpNoDelay(true);
+      synchronized (lock) {


Let's explain a bit more.

The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().

added lock, replaced bareSocket

f2d45ba

tonyjongyoonan added bug okhttp labels Jun 13, 2023

tonyjongyoonan requested a review from ejona86 June 13, 2023 18:29

tonyjongyoonan linked an issue Jun 13, 2023 that may be closed by this pull request

tsan: okhttpserver socket close data race #10228

Closed

ejona86 reviewed Jun 13, 2023

View reviewed changes

ejona86 mentioned this pull request Jun 13, 2023

okhttp: Use a real socket during server transport testing #10281

Merged

added locks to both socket and setTcpNoDelay()

4ef2ee5

tonyjongyoonan requested a review from YifeiZhuang June 15, 2023 22:05

tonyjongyoonan linked an issue Jun 15, 2023 that may be closed by this pull request

OkHttpServer TlsTest TSAN race #10284

Closed

YifeiZhuang approved these changes Jun 21, 2023

View reviewed changes

ejona86 reviewed Jun 21, 2023

View reviewed changes

added comment about java jdk bug

9a7e29d

tonyjongyoonan mentioned this pull request Jun 22, 2023

OkHttp TlsTest TSAN data race #10294

Open

tonyjongyoonan removed a link to an issue Jun 22, 2023

tsan: okhttpserver socket close data race #10228

Closed

ejona86 approved these changes Jun 22, 2023

View reviewed changes

tonyjongyoonan added 2 commits June 22, 2023 09:02

better note of bug

e523fcd

better note of bug 2

4232a84

tonyjongyoonan merged commit 2effae6 into grpc:master Jun 22, 2023

tonyjongyoonan deleted the okhttpservertransport-socket-data-race branch June 22, 2023 17:46

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

okhttp: tsan socket data race bug fix #10279

okhttp: tsan socket data race bug fix #10279

Uh oh!

Conversation

tonyjongyoonan commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejona86 Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonyjongyoonan Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonyjongyoonan Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonyjongyoonan commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jun 15, 2023

Uh oh!

tonyjongyoonan commented Jun 16, 2023

Uh oh!

ejona86 commented Jun 16, 2023

Uh oh!

YifeiZhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tonyjongyoonan commented Jun 13, 2023 •

edited

Loading

ejona86 Jun 13, 2023 •

edited

Loading

tonyjongyoonan Jun 13, 2023 •

edited

Loading

tonyjongyoonan Jun 13, 2023 •

edited

Loading

tonyjongyoonan commented Jun 15, 2023 •

edited

Loading