Skip to content

Fix a bug in CUDA pool op #19780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

Fix a bug in CUDA pool op #19780

wants to merge 15 commits into from

Conversation

snnn
Copy link
Member

@snnn snnn commented Mar 5, 2024

Description

PR #17200 introduced a bug in CUDA EP. Before the change, a piece of code in poo.cc was like:

    // cudnn only takes 4D or 5D input, so pad dimensions if needed
    x_dims_cudnn.push_back(1);
    y_dims_cudnn.push_back(1);
    pads.insert(pads.begin() + kernel_shape.size(), 0);
    pads.insert(pads.end(), 0);
    kernel_shape.push_back(1);
    strides.push_back(1);

After the change, the code becomes:

    if (NHWC) {
      x_dims_cudnn.insert(x_dims_cudnn.begin() + 1, 1);
      y_dims_cudnn.insert(y_dims_cudnn.begin() + 1, 1);
      kernel_shape.insert(kernel_shape.begin() + 1, 1);
      strides.insert(strides.begin() + 1, 1);
    } else {
      x_dims_cudnn.push_back(1);
      y_dims_cudnn.push_back(1);
      kernel_shape.push_back(1);
      strides.push_back(1);
    }
    pads.insert(pads.begin() + kernel_shape.size(), 0);
    pads.insert(pads.end(), 0);

The change swapped the order of "pads.insert(pads.begin() + kernel_shape.size(), 0);" and "kernel_shape.push_back(1);"
Therefore the "pads.insert" could cause an out-of-bounds write.

Motivation and Context

@@ -182,16 +182,20 @@ Status Pool<T, PoolType, NHWC>::ComputeInternal(OpKernelContext* context) const
if (NHWC) {
x_dims_cudnn.insert(x_dims_cudnn.begin() + 1, 1);
y_dims_cudnn.insert(y_dims_cudnn.begin() + 1, 1);
ORT_ENFORCE(pads.size() >= kernel_shape.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I'm not sure what's the correct order for this NHWC path.

Copy link
Contributor

@tianleiwu tianleiwu Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NHWC implementation seems not right: x_dims_cudnn.insert(x_dims_cudnn.begin() + 1, 1) means it adds a dimension H (like NWC => NHWC). Then the kernel_shape and pads for spatial axis shall pad to the first instead of second.

I think a proper implementation (NHC => NHWC):

if (NHWC) {
      x_dims_cudnn.insert(x_dims_cudnn.begin() + 2, 1);
      y_dims_cudnn.insert(y_dims_cudnn.begin() + 2, 1);
      ORT_ENFORCE(pads.size() >= kernel_shape.size());
      pads.insert(pads.begin() + kernel_shape.size(), 0);
      pads.insert(pads.end(), 0);
      kernel_shape.push_back(1);
      strides.push_back(1);
 }

Then we can refactor it to share last 5 lines of common code for NHWC and NCHW.

BTW, the code for global pooling also need add NHWC support (in a separated PR):

if (pool_attrs_.global_pooling) {
kernel_shape.assign(x_dims.begin() + 2, x_dims.end());
pads.assign(kernel_shape.size(), 0);
strides.assign(kernel_shape.size(), 1);
}
auto out_channel = NHWC ? x_shape[3] : x_shape[1];

I think pads shall be like pads.assign(kernel_shape.size() * 2, 0); in line 165?
Line 168 does not handle 3D or 5D inputs, could cause issue as well.

Copy link
Member

@hariharans29 hariharans29 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We have been adding the extra feature dim to the right of the existing feature dim - not the to the left.

(1) It seems like - we need 1-D NHWC Pool tests added (would have caught the issue Tianlei mentioned)
(2) The existing 1-D NCHW Pool tests need some enhancements - (

static void MaxPool1D_8_WithIndexTest(int64_t storage_order) {
) seem to only have tests with pads as 0s meaning it can't catch this bug. If we had a 1-D test with pads like {0,1} for example - it would have made pads {0,1,0,0} with the change in the PR instead of the correct {0,0,1,0} with the previous logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the code shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the code for global pooling also need add NHWC support (in a separated PR):

Right. But how could we already have a unit test for it, and it is passing? See this commit: 8e1b4b8
The commit was trying to disable the test.

@snnn snnn linked an issue Mar 5, 2024 that may be closed by this pull request
@snnn snnn added the ep:CUDA issues related to the CUDA execution provider label Mar 5, 2024
@pranavsharma
Copy link
Contributor

Is there a test case?

@snnn
Copy link
Member Author

snnn commented Mar 5, 2024

Is there a test case?

It was an out-of-bounds write.error, which results undefined behavior, which mean a test case often may still pass.

@hariharans29
Copy link
Member

CC: @gedoensmax as FYI

@prathikr
Copy link
Contributor

prathikr commented Mar 7, 2024

FYI: AzureML Vision team is also encountering this issue. Thank you @snnn @hariharans29 and @pranavsharma for investigating.

Repro:

import torch.nn.functional as F
from torch import nn
from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
import torch

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28, 10)
        self.fc2 = nn.Linear(10, 10)
        self.pooler = nn.AdaptiveAvgPool1d(1) # BUG

    def forward(self, x):
        x = self.pooler(x) # BUG
        x = x.view(x.shape[0], -1)
        
        x = F.relu(self.fc1(x))
        x = self.fc2(x)

        return x

model = Net()
model = ORTModule(model, DebugOptions(save_onnx=True, onnx_prefix='pooler', log_level=LogLevel.VERBOSE))
model.to("cuda")

images = torch.randn(8, 28, 28).to("cuda")
output = model(images)

@snnn
Copy link
Member Author

snnn commented Mar 11, 2024

Add a test case. Without the code change in pool.cc, the code fails with

[chasun@bigdisk Debug]$ ./onnxruntime_test_all --gtest_filter=PoolTest.MaxPool1D_case3
Note: Google Test filter = PoolTest.MaxPool1D_case3
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PoolTest
[ RUN ] PoolTest.MaxPool1D_case3
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:300: Failure
The difference between cur_expected[i] and cur_actual[i] is 3.4028234663852886e+38, which exceeds tolerance, where
cur_expected[i] evaluates to 2,
cur_actual[i] evaluates to -3.4028234663852886e+38, and
tolerance evaluates to 0.004999999888241291.
i:0
Google Test trace:
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:489: provider type: CUDAExecutionProvider
/data/bt/src/onnxruntime/onnxruntime/test/providers/base_tester.cc:809: registered execution providers: CUDAExecutionProvider
Stack trace:
0xf809b6: onnxruntime::test::(anonymous namespace)::InternalNumericalCheck<>()
0xf7e5fa: onnxruntime::test::(anonymous namespace)::TensorCheck<>::operator()()
0xf82aa5: onnxruntime::utils::mltype_dispatcher_internal::CallableDispatchableHelper::Invoke<>()
0xf81b61: onnxruntime::utils::MLTypeCallDispatcher<>::InvokeWithLeadingTemplateArgs<>()
0xf80f30: onnxruntime::utils::MLTypeCallDispatcher<>::Invoke<>()
0xf7f760: onnxruntime::test::Check<>()
0xf81014: onnxruntime::test::CheckDispatch<>()
0xf7fcb3: onnxruntime::test::CheckOrtValuesAreEqual()
0xf78015: onnxruntime::test::BaseTester::ExecuteModel<>()
0xf7434b: onnxruntime::test::BaseTester::ExecuteModelForEps()
0xf72c85: onnxruntime::test::BaseTester::RunWithConfig()
0xf71806: onnxruntime::test::BaseTester::Run()
0xf716b1: onnxruntime::test::BaseTester::Run()
0x128913e: onnxruntime::test::PoolTest_MaxPool1D_case3_Test::TestBody()
0x2993ae1: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x298d64e: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x2973122: testing::Test::Run()
0x2973b10: testing::TestInfo::Run()
... Google Test internal frames ...

/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:300: Failure
The difference between cur_expected[i] and cur_actual[i] is 3.4028234663852886e+38, which exceeds tolerance, where
cur_expected[i] evaluates to 3,
cur_actual[i] evaluates to -3.4028234663852886e+38, and
tolerance evaluates to 0.004999999888241291.
i:1
Google Test trace:
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:489: provider type: CUDAExecutionProvider
/data/bt/src/onnxruntime/onnxruntime/test/providers/base_tester.cc:809: registered execution providers: CUDAExecutionProvider
Stack trace:
0xf809b6: onnxruntime::test::(anonymous namespace)::InternalNumericalCheck<>()
0xf7e5fa: onnxruntime::test::(anonymous namespace)::TensorCheck<>::operator()()
0xf82aa5: onnxruntime::utils::mltype_dispatcher_internal::CallableDispatchableHelper::Invoke<>()
0xf81b61: onnxruntime::utils::MLTypeCallDispatcher<>::InvokeWithLeadingTemplateArgs<>()
0xf80f30: onnxruntime::utils::MLTypeCallDispatcher<>::Invoke<>()
0xf7f760: onnxruntime::test::Check<>()
0xf81014: onnxruntime::test::CheckDispatch<>()
0xf7fcb3: onnxruntime::test::CheckOrtValuesAreEqual()
0xf78015: onnxruntime::test::BaseTester::ExecuteModel<>()
0xf7434b: onnxruntime::test::BaseTester::ExecuteModelForEps()
0xf72c85: onnxruntime::test::BaseTester::RunWithConfig()
0xf71806: onnxruntime::test::BaseTester::Run()
0xf716b1: onnxruntime::test::BaseTester::Run()
0x128913e: onnxruntime::test::PoolTest_MaxPool1D_case3_Test::TestBody()
0x2993ae1: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x298d64e: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x2973122: testing::Test::Run()
0x2973b10: testing::TestInfo::Run()
... Google Test internal frames ...

/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:300: Failure
The difference between cur_expected[i] and cur_actual[i] is 3.4028234663852886e+38, which exceeds tolerance, where
cur_expected[i] evaluates to 4,
cur_actual[i] evaluates to -3.4028234663852886e+38, and
tolerance evaluates to 0.004999999888241291.
i:2
Google Test trace:
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:489: provider type: CUDAExecutionProvider
/data/bt/src/onnxruntime/onnxruntime/test/providers/base_tester.cc:809: registered execution providers: CUDAExecutionProvider
Stack trace:
0xf809b6: onnxruntime::test::(anonymous namespace)::InternalNumericalCheck<>()
0xf7e5fa: onnxruntime::test::(anonymous namespace)::TensorCheck<>::operator()()
0xf82aa5: onnxruntime::utils::mltype_dispatcher_internal::CallableDispatchableHelper::Invoke<>()
0xf81b61: onnxruntime::utils::MLTypeCallDispatcher<>::InvokeWithLeadingTemplateArgs<>()
0xf80f30: onnxruntime::utils::MLTypeCallDispatcher<>::Invoke<>()
0xf7f760: onnxruntime::test::Check<>()
0xf81014: onnxruntime::test::CheckDispatch<>()
0xf7fcb3: onnxruntime::test::CheckOrtValuesAreEqual()
0xf78015: onnxruntime::test::BaseTester::ExecuteModel<>()
0xf7434b: onnxruntime::test::BaseTester::ExecuteModelForEps()
0xf72c85: onnxruntime::test::BaseTester::RunWithConfig()
0xf71806: onnxruntime::test::BaseTester::Run()
0xf716b1: onnxruntime::test::BaseTester::Run()
0x128913e: onnxruntime::test::PoolTest_MaxPool1D_case3_Test::TestBody()
0x2993ae1: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x298d64e: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x2973122: testing::Test::Run()
0x2973b10: testing::TestInfo::Run()
... Google Test internal frames ...

/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:300: Failure
The difference between cur_expected[i] and cur_actual[i] is 3.4028234663852886e+38, which exceeds tolerance, where
cur_expected[i] evaluates to 5,
cur_actual[i] evaluates to -3.4028234663852886e+38, and
tolerance evaluates to 0.004999999888241291.
i:3
Google Test trace:
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:489: provider type: CUDAExecutionProvider
/data/bt/src/onnxruntime/onnxruntime/test/providers/base_tester.cc:809: registered execution providers: CUDAExecutionProvider
Stack trace:
0xf809b6: onnxruntime::test::(anonymous namespace)::InternalNumericalCheck<>()
0xf7e5fa: onnxruntime::test::(anonymous namespace)::TensorCheck<>::operator()()
0xf82aa5: onnxruntime::utils::mltype_dispatcher_internal::CallableDispatchableHelper::Invoke<>()
0xf81b61: onnxruntime::utils::MLTypeCallDispatcher<>::InvokeWithLeadingTemplateArgs<>()
0xf80f30: onnxruntime::utils::MLTypeCallDispatcher<>::Invoke<>()
0xf7f760: onnxruntime::test::Check<>()
0xf81014: onnxruntime::test::CheckDispatch<>()
0xf7fcb3: onnxruntime::test::CheckOrtValuesAreEqual()
0xf78015: onnxruntime::test::BaseTester::ExecuteModel<>()
0xf7434b: onnxruntime::test::BaseTester::ExecuteModelForEps()
0xf72c85: onnxruntime::test::BaseTester::RunWithConfig()
0xf71806: onnxruntime::test::BaseTester::Run()
0xf716b1: onnxruntime::test::BaseTester::Run()
0x128913e: onnxruntime::test::PoolTest_MaxPool1D_case3_Test::TestBody()
0x2993ae1: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x298d64e: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x2973122: testing::Test::Run()
0x2973b10: testing::TestInfo::Run()
... Google Test internal frames ...

/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:300: Failure
The difference between cur_expected[i] and cur_actual[i] is 3.4028234663852886e+38, which exceeds tolerance, where
cur_expected[i] evaluates to 5,
cur_actual[i] evaluates to -3.4028234663852886e+38, and
tolerance evaluates to 0.004999999888241291.
i:4
Google Test trace:
/data/bt/src/onnxruntime/onnxruntime/test/providers/checkers.cc:489: provider type: CUDAExecutionProvider
/data/bt/src/onnxruntime/onnxruntime/test/providers/base_tester.cc:809: registered execution providers: CUDAExecutionProvider
Stack trace:
0xf809b6: onnxruntime::test::(anonymous namespace)::InternalNumericalCheck<>()
0xf7e5fa: onnxruntime::test::(anonymous namespace)::TensorCheck<>::operator()()
0xf82aa5: onnxruntime::utils::mltype_dispatcher_internal::CallableDispatchableHelper::Invoke<>()
0xf81b61: onnxruntime::utils::MLTypeCallDispatcher<>::InvokeWithLeadingTemplateArgs<>()
0xf80f30: onnxruntime::utils::MLTypeCallDispatcher<>::Invoke<>()
0xf7f760: onnxruntime::test::Check<>()
0xf81014: onnxruntime::test::CheckDispatch<>()
0xf7fcb3: onnxruntime::test::CheckOrtValuesAreEqual()
0xf78015: onnxruntime::test::BaseTester::ExecuteModel<>()
0xf7434b: onnxruntime::test::BaseTester::ExecuteModelForEps()
0xf72c85: onnxruntime::test::BaseTester::RunWithConfig()
0xf71806: onnxruntime::test::BaseTester::Run()
0xf716b1: onnxruntime::test::BaseTester::Run()
0x128913e: onnxruntime::test::PoolTest_MaxPool1D_case3_Test::TestBody()
0x2993ae1: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x298d64e: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x2973122: testing::Test::Run()
0x2973b10: testing::TestInfo::Run()
... Google Test internal frames ...

[ FAILED ] PoolTest.MaxPool1D_case3 (147 ms)
[----------] 1 test from PoolTest (147 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (147 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] PoolTest.MaxPool1D_case3

hariharans29
hariharans29 previously approved these changes Mar 11, 2024
hariharans29
hariharans29 previously approved these changes Mar 11, 2024
@snnn
Copy link
Member Author

snnn commented Mar 13, 2024

This PR is ready to be merged.

prathikr
prathikr previously approved these changes Mar 13, 2024
@snnn snnn requested a review from tianleiwu March 13, 2024 19:40
@@ -59,6 +59,8 @@ TYPED_TEST(CudaNhwcTypedTest, MaxPoolNhwc) {
MAKE_PROVIDERS()
}

#if 0
Copy link
Member

@hariharans29 hariharans29 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to keep this test ? It is 2-D NHWC Global Pooling (not 1-D which we disabled support for)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I reverted it. Thanks for pointing it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the NHWC 2D global pool is still not right. I am debugging it.

hariharans29
hariharans29 previously approved these changes Mar 13, 2024
// The last dim of x_dims is channel(C).
// Put the other part in kernel_shape
kernel_shape.assign(x_dims.begin() + 1, x_dims.end() - 1);
pads.assign(kernel_shape.size(), 0);
Copy link
Contributor

@tianleiwu tianleiwu Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pads need to be 2X size of kernel_shape see line 203

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how could the old code work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 if (pool_attrs_.global_pooling) {
    kernel_shape.assign(x_dims.begin() + 2, x_dims.end());
    pads.assign(kernel_shape.size(), 0);
    strides.assign(kernel_shape.size(), 1);
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test case for that? How did it pass?

Copy link
Member

@hariharans29 hariharans29 Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there are no "pads" for Global pooling - https://onnx.ai/onnx/operators/onnx__GlobalMaxPool.html (i.e.) pads are irrelevant

Copy link
Contributor

@tianleiwu tianleiwu Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snnn, that's discussed in #19889 (comment). I think that PR is almost ready since tests have passed (need some minor change to address build and feedback). That PR can address this bug as well.
The only concern is that PR might depend on other NHWC change, so it might be little harder to cherry-pick if we want to have a patch release soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just fix the NCHW regression (for the patch) and leave anything related to NHWC un-implemented that can be filled in by the other PR. In any case, I don't think we have a ton of NHWC users just yet that will be broken by the patch release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine. If we gave up this one, we should cherry-pick #19889 to the patch release. That's PR is not very big. So I am not worried.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the bugs are fixed ,we should refactor the SetPoolingNdDescriptorHelper function a little: change the three array type parameters to std::span, and check if all their lengths equal. I can take the work after #19889 is merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in CUDAExecutionProvider::GetCapability function https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/cuda_execution_provider.cc we should check if paddings are symmetric when the op is a pooling op but it is not global pooling.

@snnn
Copy link
Member Author

snnn commented Mar 15, 2024

Replaced by #19889

@snnn snnn closed this Mar 15, 2024
@snnn snnn deleted the fix_bug3 branch June 19, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUDA][1.17.1][Regression] Access Violation in an inference session Run
7 participants