-
Notifications
You must be signed in to change notification settings - Fork 14.6k
Description
I've hit what I think is a miscompilation bug in clang, where a write is moved in an illegal way that introduces a data race and/or use of uninitialized memory. Here is a test case reduced from my real codebase (Compiler Explorer):
#include <coroutine>
#include <utility>
struct SomeAwaitable {
// Resume the supplied handle once the awaitable becomes ready,
// returning a handle that should be resumed now for the sake of symmetric transfer.
// If the awaitable is already ready, return an empty handle without doing anything.
//
// Defined in another translation unit. Note that this may contain
// code that synchronizees with another thread.
std::coroutine_handle<> Register(std::coroutine_handle<>);
};
// Defined in another translation unit.
void DidntSuspend();
struct Awaiter {
SomeAwaitable&& awaitable;
bool suspended;
bool await_ready() { return false; }
std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h) {
// Assume we will suspend unless proven otherwise below. We must do
// this *before* calling Register, since we may be destroyed by another
// thread asynchronously as soon as we have registered.
suspended = true;
// Attempt to hand off responsibility for resuming/destroying the coroutine.
const auto to_resume = awaitable.Register(h);
if (!to_resume) {
// The awaitable is already ready. In this case we know that Register didn't
// hand off responsibility for the coroutine. So record the fact that we didn't
// actually suspend, and tell the compiler to resume us inline.
suspended = false;
return h;
}
// Resume whatever Register wants us to resume.
return to_resume;
}
void await_resume() {
// If we didn't suspend, make note of that fact.
if (!suspended) {
DidntSuspend();
}
}
};
struct MyTask{
struct promise_type {
MyTask get_return_object() { return {}; }
std::suspend_never initial_suspend() { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
void unhandled_exception();
auto await_transform(SomeAwaitable&& awaitable) {
return Awaiter{std::move(awaitable)};
}
};
};
MyTask FooBar() {
co_await SomeAwaitable();
}
The idea is that the awaiter is implemented by calling a Register
function in a foreign translation unit that decides what to do:
-
If the coroutine should be resumed immediately, it returns a null handle to indicate this.
-
If the coroutine will be resumed later, it reduces some other handle to resume now, for symmetric control. (Maybe
std::noop_coroutine()
.)
Further, when we don't actually wind up suspending we need await_resume
to do some follow-up work, in this case represented by calling the DidntSuspend
function. So we use a suspended
member to track whether we actually suspended. This is written before calling Register
, and read after resuming.
The bug I see in my codebase is that the write of true
to suspended
is delayed until after the call to Register
. In the reduced test case, we have something similar. Here is what Compiler Explorer gives me for clang with -std=c++20 -O1 -fno-exceptions
:
FooBar(): # @FooBar()
push rbx
mov edi, 32
call operator new(unsigned long)
mov rbx, rax
mov qword ptr [rax], offset FooBar() [clone .resume]
mov qword ptr [rax + 8], offset FooBar() [clone .destroy]
lea rdi, [rax + 18]
mov byte ptr [rax + 17], 0
mov rsi, rax
call SomeAwaitable::Register(std::__n4861::coroutine_handle<void>)
mov qword ptr [rbx + 24], rax
test rax, rax
cmove rax, rbx
mov rdi, rax
call qword ptr [rax]
pop rbx
ret
FooBar() [clone .resume]: # @FooBar() [clone .resume]
push rbx
mov rbx, rdi
cmp qword ptr [rdi + 24], 0
jne .LBB1_2
call DidntSuspend()
.LBB1_2:
mov qword ptr [rbx], 0
pop rbx
ret
FooBar() [clone .destroy]: # @FooBar() [clone .destroy]
push rax
call operator delete(void*)
pop rax
ret
The coroutine frame address is in rbx
. After calling Register
, the returned handle is stored into the coroutine frame at offset 24 and then resumed (or the original handle resumed if it's empty), and later in [clone .resume]
the handle in the frame at offset 24 is compared to zero to synthesize the if (!suspended)
condition.
But it's not safe to store the returned handle in the coroutine frame unless it's zero: any other value indicates that Register
took responsibility for the coroutine handle, and may have passed it off to another thread. So another thread may have called destroy
on the handle by the time we get around to writing into it. Similarly, the other thread may already have resumed the coroutine and see an uninitialized value at offset 24.
I think this is a miscompilation. Consider for example that Register
may contain a critical section under a mutex that hands the coroutine handle off to another thread to resume, with a similar critical section in the other thread synchronizing with the first. (This is the situation in my codebase.) So we have:
-
The write of
suspended
inawait_suspend
is sequenced before the call toRegister
below it inawait_suspend
. -
The call to
Register
synchronizes with the function on the other thread that resumes the coroutine. -
That synchronization is sequenced before resuming the coroutine handle.
-
Resuming the coroutine handle is (I believe?) sequenced before the call to
await_resume
that readssuspended
. -
Therefore the write of
suspended
inter-thread happens before the read ofsuspended
.
So there was no data race before, but clang has introduced one by delaying the write to the coroutine frame.
For what it's worth, I spent some time dumping IR after optimization passes with my real codebase, and in that case this seemed to be related to an interaction betweem SROAPass
and CoroSplitPass
:
-
Until
SROAPass
the write was a simple store to the coroutine frame, before the call toRegister
. -
SROAPass
eliminated the write altogether, turning it into phi nodes that plumbed the value directly into the branch. The value was plumbed from before the call toRegister
to after it. -
CoroSplitPass
re-introduced astore
instruction, after the call toRegister
.
I am far from an expert here, but I wonder if SROAPass
should be forbidden from making optimizatons of this sort across an llvm.coro.suspend
?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Activity
jacobsa commentedon Jun 30, 2022
By the way, I should mention that I discovered this because tsan reports it as a data race. And I think it's correct: clang has introduced a data race by putting a write after the call to
Register
, by which time another thread could be using the coroutine frame.aeubanks commentedon Jun 30, 2022
@ChuanqiXu9
ChuanqiXu9 commentedon Jul 1, 2022
I'm not sure if this is related to a known bug that current coroutine couldn't cache TLS variable correctly.
@jacobsa Could you build clang from source? If yes, could you test it again after applying https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383?
jacobsa commentedon Jul 1, 2022
@ChuanqiXu9: just saw your comment after writing this. I'll try that shortly, but it may take me some time because I've never done it before. In the meantime here is some information about the IR—can you tell whether it's related based on that?
Here is an IR dump after each optimization pass made with
-mllvm -print-after-all -mllvm -filter-print-funcs=_Z6FooBarv
. It was made with a Google-internal build of clang based ondb1978b674
, and the build settings might be slightly different from the Compiler Explorer link above.You can see that in the version on line 3669 we still have the correct control flow:
However the
SROAPass
on line 3877 eliminates the stores, turning them into aphi
node to select false or true depending on the result ofRegister
, and then later use that to decide whether to callDidntSuspend
:The lack of a store is preserved up through the version on line 3068:
But then on line 6111
CoroSplitPass
takes this and introduces the incorrect unconditionalstore
afterRegister
returns:I'd appreciate anybody's thoughts about what could be done to prevent this.
jacobsa commentedon Jul 1, 2022
@ChuanqiXu9 okay yes, I can reproduce this at
91ab4d4231e5b7456d012776c5eeb69fa61ab994
:I applied https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383 in their current state and rebuilt clang, and still get the same result. I guess that makes sense—there is no TLS here.
ChuanqiXu9 commentedon Jul 1, 2022
Oh, sorry for misleading.
I think I get the problem. Long story short, your analysis (and the analysis of tsan) is correct. This is a (potential) miscompile.
Here is the reason:
The key issue here is that:
suspended
is escaped too. So we shouldn't sink it.suspended
lives in a structure that handle refers too.I think we need to introduce something like CoroutineAA to provide the information. I would try to look at it.
And it wouldn't be done in a few days so you probably need to do some workaround. Maybe something like
DO_NOT_OPTIMIZE(...)
?This is not an option to me. The key reason why Clang/LLVM want to construct coroutine frames is about the performance. And in fact, there were many such bugs about coroutines, which could be fixed in one shot if we disable the optimizations. So our strategy is always to fix the actual issues. As a heavy user and developer of coroutines, I believe it should be the right choice since the performance is a key reason why we chose C++.
jacobsa commentedon Jul 1, 2022
Yeah, I didn't mean disabling optimizations altogether. Just recognizing that this particular optimization shouldn't be performed for objects that span an
llvm.coro.suspend
.It's probably more complicated than I realize. Thanks for looking; I look forward to seeing what fix you come up with. :-)
havardpe commentedon Nov 29, 2022
I have recently run into the same issue using clang 14.0.6. My conclusion is that the
await_suspend
function is inlined into the coroutine function, and as a side-effect, (in your case) theto_resume
variable is converted from a stack variable to a coroutine state variable, which makes it unsafe to check after you have tried to give the coroutine away. A work-around is to tag theawait_suspend
function with__attribute__((noinline))
. gcc (11.2.1) does not seem to have this issue.ChuanqiXu9 commentedon Nov 30, 2022
GCC has much less coroutine bugs than clang. Since all the coroutine related works in GCC are done in the frontend. And for clang, the middle end gets involved to optimize coroutines further.
havardpe commentedon Nov 30, 2022
I am aware that the support for coroutines is much more limited in gcc. That is why I am experimenting with clang. I love the fact that clang is able to fully inline non-recursive synchronous generators. Here are some code snippets that might help pinpoint the underlying issue (hopefully the same one observed by @jacobsa).
this code does not trigger the issue:
this code triggers the issue if the
noinline
tag is removed:The issue (according to TSAN) is that the local
task
variable ends up in the coroutine frame in the second version (but apparently not in the first). This may be caused by its entanglement with theaccepted
frame variable. It might get tagged with 'needs to be stored in the state since it might be used after the coroutine is suspended'. But in reality the variable needs to perform areverse-escape
from the coroutine frame into the stack in order to live long enough to be checked after the coroutine state has been destroyed by another thread.80 remaining items
ChuanqiXu9 commentedon Sep 18, 2023
Remove this from LLVM17.x Release milestone since the fix wouldn't be there.
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…
[C++20] [Coroutines] Mark await_suspend as noinline if the awaiter is…