-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Coroutine miscompilation on x86_64 #62842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc MatzeB I can put up instructions for the reproducer (quite tedious since it's part of a huge program), or work with you to find a way to localize the problem. |
@llvm/issue-subscribers-coroutines |
/cc @MatzeB correctly I can put up instructions for the reproducer (quite tedious since it's part of a huge program), or work with you to find a way to localize the problem. |
I confirmed that main (382b56d) is broken. |
I don't think there is much I can do without a smaller reproducer, sorry. My change should only make choices more conservative and put more things onto the coroutine frame. So most likely it is "just" triggering an unrelated pre-existing bug in your code or the compiler. Unfortunately to make progress we need a reduction, looking at full applications is rarely productive when looking for compiler bugs. |
Understood. I'll try to reduce (I guess the first step is to determine if it's related to ASAN or not). It happens early during execution, so there's some hope. |
This takes asan/ubsan out of the equation Ref: scylladb#13730, llvm/llvm-project#62842 Reproducer: ninja build/release/test/boost/sstable_datafile_test_g build/release/test/boost/sstable_datafile_test -t test_validate_checksums -- --smp 1
Update: I reduced it to future<std::vector<mutation>> my_coroutine(
uint32_t seed,
tests::random_schema& random_schema,
tests::timestamp_generator ts_gen,
tests::expiry_generator exp_gen) {
auto engine = std::mt19937(seed);
const auto schema_has_clustering_columns = random_schema.schema()->clustering_key_size() > 0;
const auto partition_count = 3;
std::vector<mutation> muts;
muts.reserve(partition_count);
for (size_t pk = 0; pk != partition_count; ++pk) {
auto mut = random_schema.new_mutation(pk);
random_schema.set_partition_tombstone(engine, mut, ts_gen, exp_gen);
random_schema.add_static_row(engine, mut, ts_gen, exp_gen);
if (!schema_has_clustering_columns) {
muts.emplace_back(mut.build(random_schema.schema()));
continue;
}
const auto clustering_row_count = 3;
const auto range_tombstone_count = 3;
auto ckeys = random_schema.make_ckeys(std::max(clustering_row_count, range_tombstone_count));
for (uint32_t ck = 0; ck < ckeys.size(); ++ck) {
random_schema.add_row(engine, mut, ckeys[ck], ts_gen, exp_gen);
co_await coroutine::maybe_yield();
}
muts.emplace_back(mut.build(random_schema.schema()));
}
co_return std::move(muts);
}
SEASTAR_TEST_CASE(test_validate_checksums) {
return test_env::do_with_async([&] (test_env& env) {
auto random_spec = tests::make_random_schema_specification(
get_name(),
std::uniform_int_distribution<size_t>(1, 4),
std::uniform_int_distribution<size_t>(2, 4),
std::uniform_int_distribution<size_t>(2, 8),
std::uniform_int_distribution<size_t>(2, 8));
auto random_schema = tests::random_schema{tests::random::get_int<uint32_t>(), *random_spec};
testlog.info("Random schema:\n{}", random_schema.cql());
const auto muts = my_coroutine(7,
random_schema,
tests::default_timestamp_generator(),
tests::no_expiry_expiry_generator()
).get();
});
} It's nowhere close to being self-contained, there's a big coroutine runtime and a large library this calls into. But if I replace std::vector<mutation> with std::vector<int>, and remove all the related computations, the problem goes away. So it's clear the bug needs some complexity to materialize. |
It would be pretty helpful to provide a reproducer which doesn't dependent on seastar it self. |
another data point: the test previous failed with seastar alone does not suffice to reproduce this issue. as Avi put, i guess it would require a little more complexity to trigger. i compiled and run Seastar tests with Clang built with 552ee85. and all the tests were passed. |
to workaround llvm/llvm-project#62842, per the test this issue only surfaces when compiling the tree with llvm/llvm-project@ae7bf2b which is included in Clang version 16, and the issue disappears when the tree is compiled with -O0. Signed-off-by: Kefu Chai <[email protected]>
to workaround llvm/llvm-project#62842, per the test this issue only surfaces when compiling the tree with llvm/llvm-project@ae7bf2b which is included in Clang version 16, and the issue disappears when the tree is compiled with -O0. Signed-off-by: Kefu Chai <[email protected]>
to workaround llvm/llvm-project#62842, per the test this issue only surfaces when compiling the tree with llvm/llvm-project@ae7bf2b which is included in Clang version 16, and the issue disappears when the tree is compiled with -O0. Signed-off-by: Kefu Chai <[email protected]> Closes #14391
Still reproduces as of 843ff75. I was hoping the recent coroutine changes would improve things. |
I think may be we can look at it closely after we had minimal reproducer (that is self contained a single file reproducer, which didn't depend on any other libraries like folly). |
Well I have a single file reproducer that doesn't depend on anything (preprocessed), but it's huge. The failure happens very quickly though. |
Would you like to reduce it further? The best example would be #61900 |
For sure I'd like to, but that's a very laborious process. |
@avikivity Hi Avi, i verified @michoecho 's fix posted at scylladb/scylladb#13730 (comment) . as he explained, this is an issue in Seastar. probably we could close this issue, and follow it up at scylladb/seastar#1759 ? |
Closing as invalid. |
ae7bf2b (which fixed a coroutine miscompilation, #51843) causes a coroutine miscompilation for me. Unfortunately the reproducer is huge, I can try to reduce it if there's no other option.
clang 15 works; I bisected llvm-project to this commit.
In this code:
h.promise() returns garbage, after which the program crashes. It's compiled with asan/ubsan, but I have no indication that it's related so far.
The text was updated successfully, but these errors were encountered: