-
Notifications
You must be signed in to change notification settings - Fork 3.9k
okhttp: tsan socket data race bug fix #10279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
okhttp: tsan socket data race bug fix #10279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fair. One question.
config.handshakerSocketFactory.handshake(bareSocket, Attributes.EMPTY); | ||
Socket socket = result.socket; | ||
config.handshakerSocketFactory.handshake(socket, Attributes.EMPTY); | ||
socket.setTcpNoDelay(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this back before handshake()? Given we talked about this, I assume something didn't like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see #10278 (comment) now. Let me look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That TSAN failure looks like a bug in Java's Socket. Glancing at the JDK, it is synchronized. But that was a recent change. It was fixed really recently, as seen with blame. https://bugs.openjdk.org/browse/JDK-8278326
So that failure looks like it is already fixed upstream; it just needs to sync internally. But it also looks like all previous versions of Socket were broken. This makes shutdown overall broken. Eww...
Let's move the setTcpNoDelay(true)
up before handshake(), but put it in a synchronized
. That will then either 1) initializes the socket, and guarantee it doesn't run concurrently with shutdown() or 2) throw, because shutdown() already closed the socket. And we should have a comment linking to that JDK issue.
Looking back at #10228 , the same problem was happening there. Could you run a test with this.socket = result.socket;
commented out to see if the "synchronized setTcpNoDelay" fixes the problem by itself? Commenting out this.socket = result.socket;
is the equivalent of swapping back to the "final bareSocket". It'd take some work for me to find the TLS race I saw before, and I wonder if the problem was actually just always this Socket bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, keeping the socket.setTcpNoDelay(true) before handshake() results in other data race errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither seems to fix the issue.
Commenting out this.socket = result.socket
with a synchronized lock to the setTcpNoDelay (moved back up) gives us 112/1000 failures.
And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.
In both cases, a synchronized lock on setTcpNoDelay(true)
seems to bring the failures down from ~300/1000 to ~100/1000.
It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true)
is either moved down or removed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And keeping that line in with the synchronized lock to the setTcpNoDelay (moved back up) gives us 127/1000 failures.
Ugh. I'm realizing that failure is a bug in the test. The race isn't with shutdown there. The "client" (in the test) calls close and it is considering that a race. In real life that would come from the kernel. What's even worse is PipeSocket doesn't actually use any Socket resources; it is a fake. We'll probably need to replace that with a real socket. That also invalidates some of the testing you've done up to this point.
Try to reproduce a failure using #10281 . Assuming that makes things look better, I can send that out as a PR. See go/grpc-prod/java/importing#cherry-picking-prs for how to copy it to your client.
We first noticed the TSAN failure when I saw a TLS handshake failure. Later, Ivy created the issue she assigned to you with a recent failure, but it was a different failure than what I saw. The failure in that issue may be a test problem, and so assuming my change fixes that test problem you'll want to run some more tests to see if you can trigger other failures. (We can chat about this over GVC)
It seems like the original code with the bareSocket (as well as my revised one) passes all test cases when the setTcpNoDelay(true) is either moved down or removed entirely.
That's probably because handshake() does very little in these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That second link (to the 9%) looks wrong. |
Sorry, it's fixed now. |
That is worrying because it is an issue in the SSLSocketImpl.close() flow that we were swapping to. But it appears to be pre-existing and a different code path. Run 5 in the 70% run had the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, having a separate issue tracking the new problem sounds fair to me.
private void startIo(SerializingExecutor serializingExecutor) { | ||
try { | ||
bareSocket.setTcpNoDelay(true); | ||
synchronized (lock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this synchronized? If we do, we're going to need a comment as to why (probably a link to the Java bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes--and just added the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explain a bit more.
The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().
private void startIo(SerializingExecutor serializingExecutor) { | ||
try { | ||
bareSocket.setTcpNoDelay(true); | ||
synchronized (lock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explain a bit more.
The socket implementation is lazily initialized, but had broken thread-safety for that laziness https://bugs.openjdk.org/browse/JDK-8278326 . As a workaround, we lock to synchronize initialization with shutdown().
The original TSAN data race described in #10228 was fixed by rewriting a test case that used a PipeSocket which triggered a bug in Java's Socket as described in #10281. Doing so uncovered the "original" data race bug: #10284.