Skip to content

Conversation

tonyjongyoonan
Copy link
Contributor

@tonyjongyoonan tonyjongyoonan commented Jun 13, 2023

The original TSAN data race described in #10228 was fixed by rewriting a test case that used a PipeSocket which triggered a bug in Java's Socket as described in #10281. Doing so uncovered the "original" data race bug: #10284.

@tonyjongyoonan tonyjongyoonan requested a review from ejona86 June 13, 2023 18:29
@tonyjongyoonan tonyjongyoonan linked an issue Jun 13, 2023 that may be closed by this pull request
Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fair. One question.

config.handshakerSocketFactory.handshake(bareSocket, Attributes.EMPTY);
Socket socket = result.socket;
config.handshakerSocketFactory.handshake(socket, Attributes.EMPTY);
socket.setTcpNoDelay(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this back before handshake()? Given we talked about this, I assume something didn't like it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see #10278 (comment) now. Let me look.

Copy link
Member

@ejona86 ejona86 Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That TSAN failure looks like a bug in Java's Socket. Glancing at the JDK, it is synchronized. But that was a recent change. It was fixed really recently, as seen with blame. https://bugs.openjdk.org/browse/JDK-8278326

So that failure looks like it is already fixed upstream; it just needs to sync internally. But it also looks like all previous versions of Socket were broken. This makes shutdown overall broken. Eww...

Let's move the setTcpNoDelay(true) up before handshake(), but put it in a synchronized. That will then either 1) initializes the socket, and guarantee it doesn't run concurrently with shutdown() or 2) throw, because shutdown() already closed the socket. And we should have a comment linking to that JDK issue.

Looking back at #10228 , the same problem was happening there. Could you run a test with this.socket = result.socket; commented out to see if the "synchronized setTcpNoDelay" fixes the problem by itself? Commenting out this.socket = result.socket; is the equivalent of swapping back to the "final bareSocket". It'd take some work for me to find the TLS race I saw before, and I wonder if the problem was actually just always this Socket bug.

Copy link
Contributor Author

@tonyjongyoonan tonyjongyoonan Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keeping the socket.setTcpNoDelay(true) before handshake() results in other data race errors

Copy link
Contributor Author

@tonyjongyoonan tonyjongyoonan Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither seems to fix the issue.

Commenting out this.socket = result.socket with a synchronized lock to the setTcpNoDelay (moved back up) gives us 112/1000 failures.

And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.

In both cases, a synchronized lock on setTcpNoDelay(true) seems to bring the failures down from ~300/1000 to ~100/1000.

It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true) is either moved down or removed entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.

Ugh. I'm realizing that failure is a bug in the test. The race isn't with shutdown there. The "client" (in the test) calls close and it is considering that a race. In real life that would come from the kernel. What's even worse is PipeSocket doesn't actually use any Socket resources; it is a fake. We'll probably need to replace that with a real socket. That also invalidates some of the testing you've done up to this point.

Try to reproduce a failure using #10281 . Assuming that makes things look better, I can send that out as a PR. See go/grpc-prod/java/importing#cherry-picking-prs for how to copy it to your client.

We first noticed the TSAN failure when I saw a TLS handshake failure. Later, Ivy created the issue she assigned to you with a recent failure, but it was a different failure than what I saw. The failure in that issue may be a test problem, and so assuming my change fixes that test problem you'll want to run some more tests to see if you can trigger other failures. (We can chat about this over GVC)

It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true) is either moved down or removed entirely.

That's probably because handshake() does very little in these tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been barking up the wrong tree. #10281 will fix the TSAN failure you were looking at. The problem there shouldn't be fixed by changing the OkHttpServerTransport. However, I reproduced #10284 and that issue does need a fix like we have here.

@tonyjongyoonan
Copy link
Contributor Author

tonyjongyoonan commented Jun 15, 2023

It seems like this change doesn't fully fix the data races, but reduces them from occurring ~70% of the time to ~9%.

@tonyjongyoonan tonyjongyoonan linked an issue Jun 15, 2023 that may be closed by this pull request
@ejona86
Copy link
Member

ejona86 commented Jun 15, 2023

That second link (to the 9%) looks wrong.

@tonyjongyoonan
Copy link
Contributor Author

That second link (to the 9%) looks wrong.

Sorry, it's fixed now.

@ejona86
Copy link
Member

ejona86 commented Jun 16, 2023

It seems like this change doesn't fully fix the data races, but reduces them from occurring ~70% of the time to ~9%.

That is worrying because it is an issue in the SSLSocketImpl.close() flow that we were swapping to. But it appears to be pre-existing and a different code path. Run 5 in the 70% run had the same javax.crypto.Cipher. With the more frequent problem fixed, we're now able to notice a rarer problem that was previously in the noise. We can file an issue for that and track it separately.

Copy link
Member

@YifeiZhuang YifeiZhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, having a separate issue tracking the new problem sounds fair to me.

private void startIo(SerializingExecutor serializingExecutor) {
try {
bareSocket.setTcpNoDelay(true);
synchronized (lock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this synchronized? If we do, we're going to need a comment as to why (probably a link to the Java bug).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes--and just added the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explain a bit more.

The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().

private void startIo(SerializingExecutor serializingExecutor) {
try {
bareSocket.setTcpNoDelay(true);
synchronized (lock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explain a bit more.

The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().

@tonyjongyoonan tonyjongyoonan merged commit 2effae6 into grpc:master Jun 22, 2023
@tonyjongyoonan tonyjongyoonan deleted the okhttpservertransport-socket-data-race branch June 22, 2023 17:46
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OkHttpServer TlsTest TSAN race

3 participants