Skip to content

Coroutine miscompilation on x86_64 #62842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
avikivity opened this issue May 21, 2023 · 16 comments
Closed

Coroutine miscompilation on x86_64 #62842

avikivity opened this issue May 21, 2023 · 16 comments
Labels
coroutines C++20 coroutines invalid Resolved as invalid, i.e. not a bug miscompilation

Comments

@avikivity
Copy link
Contributor

ae7bf2b (which fixed a coroutine miscompilation, #51843) causes a coroutine miscompilation for me. Unfortunately the reproducer is huge, I can try to reduce it if there's no other option.

clang 15 works; I bisected llvm-project to this commit.

In this code:

    template <typename T>
    void await_suspend(std::coroutine_handle<T> h) {
        when_ready = h;
        main_coroutine_task = &h.promise(); // for waiting_task()
        schedule(this);
    }

h.promise() returns garbage, after which the program crashes. It's compiled with asan/ubsan, but I have no indication that it's related so far.

@avikivity
Copy link
Contributor Author

/cc MatzeB

I can put up instructions for the reproducer (quite tedious since it's part of a huge program), or work with you to find a way to localize the problem.

@llvmbot
Copy link
Member

llvmbot commented May 21, 2023

@llvm/issue-subscribers-coroutines

@avikivity
Copy link
Contributor Author

/cc @MatzeB correctly

I can put up instructions for the reproducer (quite tedious since it's part of a huge program), or work with you to find a way to localize the problem.

@avikivity avikivity changed the title Coroutine compilation on x86_64 Coroutine miscompilation on x86_64 May 22, 2023
@avikivity
Copy link
Contributor Author

I confirmed that main (382b56d) is broken.

@MatzeB
Copy link
Contributor

MatzeB commented May 22, 2023

I don't think there is much I can do without a smaller reproducer, sorry. My change should only make choices more conservative and put more things onto the coroutine frame. So most likely it is "just" triggering an unrelated pre-existing bug in your code or the compiler. Unfortunately to make progress we need a reduction, looking at full applications is rarely productive when looking for compiler bugs.

@avikivity
Copy link
Contributor Author

Understood. I'll try to reduce (I guess the first step is to determine if it's related to ASAN or not).

It happens early during execution, so there's some hope.

avikivity added a commit to avikivity/scylladb that referenced this issue May 23, 2023
This takes asan/ubsan out of the equation

Ref: scylladb#13730, llvm/llvm-project#62842

Reproducer:

  ninja build/release/test/boost/sstable_datafile_test_g
  build/release/test/boost/sstable_datafile_test  -t test_validate_checksums -- --smp 1
@avikivity
Copy link
Contributor Author

Update: I reduced it to

future<std::vector<mutation>> my_coroutine(
        uint32_t seed,
        tests::random_schema& random_schema,
        tests::timestamp_generator ts_gen,
        tests::expiry_generator exp_gen) {
    auto engine = std::mt19937(seed);
    const auto schema_has_clustering_columns = random_schema.schema()->clustering_key_size() > 0;
    const auto partition_count = 3;
    std::vector<mutation> muts;
    muts.reserve(partition_count);
    for (size_t pk = 0; pk != partition_count; ++pk) {
        auto mut = random_schema.new_mutation(pk);
        random_schema.set_partition_tombstone(engine, mut, ts_gen, exp_gen);
        random_schema.add_static_row(engine, mut, ts_gen, exp_gen);

        if (!schema_has_clustering_columns) {
            muts.emplace_back(mut.build(random_schema.schema()));
            continue;
        }

        const auto clustering_row_count = 3;
        const auto range_tombstone_count = 3;
        auto ckeys = random_schema.make_ckeys(std::max(clustering_row_count, range_tombstone_count));

        for (uint32_t ck = 0; ck < ckeys.size(); ++ck) {
            random_schema.add_row(engine, mut, ckeys[ck], ts_gen, exp_gen);
            co_await coroutine::maybe_yield();
        }

        muts.emplace_back(mut.build(random_schema.schema()));
    }
    co_return std::move(muts);
}


SEASTAR_TEST_CASE(test_validate_checksums) {
    return test_env::do_with_async([&] (test_env& env) {
        auto random_spec = tests::make_random_schema_specification(
                get_name(),
                std::uniform_int_distribution<size_t>(1, 4),
                std::uniform_int_distribution<size_t>(2, 4),
                std::uniform_int_distribution<size_t>(2, 8),
                std::uniform_int_distribution<size_t>(2, 8));
        auto random_schema = tests::random_schema{tests::random::get_int<uint32_t>(), *random_spec};

        testlog.info("Random schema:\n{}", random_schema.cql());

        const auto muts = my_coroutine(7,
            random_schema,
            tests::default_timestamp_generator(),
            tests::no_expiry_expiry_generator()
        ).get();
    });
}

It's nowhere close to being self-contained, there's a big coroutine runtime and a large library this calls into. But if I replace std::vector<mutation> with std::vector<int>, and remove all the related computations, the problem goes away. So it's clear the bug needs some complexity to materialize.

@ChuanqiXu9
Copy link
Member

It would be pretty helpful to provide a reproducer which doesn't dependent on seastar it self.

@tchaikov
Copy link
Contributor

tchaikov commented Jun 21, 2023

another data point: the test previous failed with -Og passes with -O0.

seastar alone does not suffice to reproduce this issue. as Avi put, i guess it would require a little more complexity to trigger. i compiled and run Seastar tests with Clang built with 552ee85. and all the tests were passed.

tchaikov added a commit to tchaikov/scylladb that referenced this issue Jun 26, 2023
to workaround llvm/llvm-project#62842,
per the test this issue only surfaces when compiling the tree
with
llvm/llvm-project@ae7bf2b
which is included in Clang version 16, and the issue disappears
when the tree is compiled with -O0.

Signed-off-by: Kefu Chai <[email protected]>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jun 26, 2023
to workaround llvm/llvm-project#62842,
per the test this issue only surfaces when compiling the tree
with
llvm/llvm-project@ae7bf2b
which is included in Clang version 16, and the issue disappears
when the tree is compiled with -O0.

Signed-off-by: Kefu Chai <[email protected]>
avikivity pushed a commit to scylladb/scylladb that referenced this issue Jun 26, 2023
to workaround llvm/llvm-project#62842,
per the test this issue only surfaces when compiling the tree
with
llvm/llvm-project@ae7bf2b
which is included in Clang version 16, and the issue disappears
when the tree is compiled with -O0.

Signed-off-by: Kefu Chai <[email protected]>

Closes #14391
@avikivity
Copy link
Contributor Author

Still reproduces as of 843ff75. I was hoping the recent coroutine changes would improve things.

@ChuanqiXu9
Copy link
Member

I think may be we can look at it closely after we had minimal reproducer (that is self contained a single file reproducer, which didn't depend on any other libraries like folly).

@avikivity
Copy link
Contributor Author

Well I have a single file reproducer that doesn't depend on anything (preprocessed), but it's huge.

The failure happens very quickly though.

@ChuanqiXu9
Copy link
Member

Well I have a single file reproducer that doesn't depend on anything (preprocessed), but it's huge.

The failure happens very quickly though.

Would you like to reduce it further? The best example would be #61900

@avikivity
Copy link
Contributor Author

For sure I'd like to, but that's a very laborious process.

@tchaikov
Copy link
Contributor

@avikivity Hi Avi, i verified @michoecho 's fix posted at scylladb/scylladb#13730 (comment) . as he explained, this is an issue in Seastar. probably we could close this issue, and follow it up at scylladb/seastar#1759 ?

@avikivity
Copy link
Contributor Author

Closing as invalid.

@avikivity avikivity closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2023
@EugeneZelenko EugeneZelenko added the invalid Resolved as invalid, i.e. not a bug label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coroutines C++20 coroutines invalid Resolved as invalid, i.e. not a bug miscompilation
Projects
Status: Done
Development

No branches or pull requests

6 participants