Skip to content

[OpenCL] Convolution NHWC vs. NCHW mismatch #3815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pjaaskel opened this issue Nov 23, 2019 · 4 comments
Closed

[OpenCL] Convolution NHWC vs. NCHW mismatch #3815

pjaaskel opened this issue Nov 23, 2019 · 4 comments

Comments

@pjaaskel
Copy link
Contributor

After fixing Issue #3802, inception_v1 now fails due to a mismatching layouts error. Should there be automatic conversion operation insertion in such a case?

./bin/image-classifier -backend=OpenCL tests/images/imagenet/dog_207.png -expected-labels=207 -image-mode=0to255 -m=../build-llvm-7/inception_v1 -model-input-name=data 
Model: ../build-llvm-7/inception_v1
Running 1 thread(s).



In 'conv1_7x7_s2__6' From '../build-llvm-7/inception_v1'
input 0
Convolution
name : conv1_7x7_s2__6
Input : float<1 x 3 x 224 x 224>
Filter : float<64 x 3 x 7 x 7>
Bias : float<64>
Kernels : [7, 7]
Strides : [2, 2]
Pads : [3, 3, 3, 3]
Group : 1
Dilation : 1
Layout : NCHW
FusedActivation : 
users : 1
Result : float<1 x 64 x 112 x 112>

Mismatching layouts:
Provided layout
Layout: NHWC [name = N : alignment = 1 : index = 0, name = H : alignment = 1 : index = 1, name = W : alignment = 1 : index = 2, name = C : alignment = 1 : index = 3]
Expected layout
Layout: NCHW [name = N : alignment = 1 : index = 0, name = C : alignment = 1 : index = 1, name = H : alignment = 1 : index = 2, name = W : alignment = 1 : index = 3]
From '../build-llvm-7/inception_v1'
Expected correct backend-specific layouts for the graph
For comparison `LHS Equal RHS` with:
LHS: 0
RHS: 1
WARNING: Logging before InitGoogleLogging() is written to STDERR
F1123 11:26:22.983803 15371 Error.cpp:119] exitOnError(Error) got an unexpected ErrorValue: 
Error code: COMPILE_UNSUPPORTED_NODE_AFTER_OPTIMIZE
Error message: Unsupported node(s) found after optimizing Function ../build-llvm-7/inception_v1 for backend OpenCL
Error return stack:
../lib/Optimizer/GraphOptimizer/GraphOptimizer.cpp:3293
../lib/Partitioner/Partitioner.cpp:401
../tools/loader/Loader.cpp:505
*** Check failure stack trace: ***
Aborted (core dumped)

Any pointers? We can take a look if someone pokes us to the right direction.

@pjaaskel
Copy link
Contributor Author

This is actually a regression introduced with commit ec46f24 by @shajrawi and affects also squeezenet. Reverting this commit fixes the issue.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Dec 7, 2019

This issue is still there with OpenCL and can be reproduced with Squeezenet.

@XiZiler
Copy link

XiZiler commented Dec 26, 2019

and the c++ ResNet demo is also affected. Following is what I do.

  1. compile for OpenCL, Release version:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DGLOW_WITH_CPU=1 -DGLOW_WITH_OPENCL=1 ../glow
ninja all -j 64
  1. check GPU info:
clinfo

Platform Name NVIDIA CUDA
Number of devices 1
Device Name GeForce RTX 2080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 65:00.0

  1. run ResNet50 demo:
./bin/resnet-runtime -backend=OpenCL -num-devices=1 -image-layout=NCHW

and get the result like this:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I1226 15:26:25.958235 21561 resnet-runtime.cpp:130] Initializing 1 OpenCL devices on HostManager.
I1226 15:26:26.116268 21561 resnet-runtime.cpp:78] Loading resnet50 model.

In 'gpu_0_conv1__5' From 'resnet500'
input 0
Convolution
name : gpu_0_conv1__5
Input : float<1 x 3 x 224 x 224>
Filter : float<64 x 3 x 7 x 7>
Bias : float<64>
Kernels : [7, 7]
Strides : [2, 2]
Pads : [3, 3, 3, 3]
Group : 1
Dilation : 1
Layout : NCHW
FusedActivation :
users : 1
Result : float<1 x 64 x 112 x 112>

Mismatching layouts:
Provided layout
Layout: NHWC [name = N : alignment = 1 : index = 0, name = H : alignment = 1 : index = 1, name = W : alignment = 1 : index = 2, name = C : alignment = 1 : index = 3]
Expected layout
Layout: NCHW [name = N : alignment = 1 : index = 0, name = C : alignment = 1 : index = 1, name = H : alignment = 1 : index = 2, name = W : alignment = 1 : index = 3]
From 'resnet500'
Expected correct backend-specific layouts for the graph
For comparison LHS Equal RHS with:
LHS: 0
RHS: 1
F1226 15:26:27.040071 21561 Error.cpp:119] exitOnError(Error) got an unexpected ErrorValue:
Error code: COMPILE_UNSUPPORTED_NODE_AFTER_OPTIMIZE
Error message: Unsupported node(s) found after optimizing Function resnet500 for backend OpenCL
Error return stack:
/home/cambricon/workspace_xi/ATC/glow/lib/Optimizer/GraphOptimizer/GraphOptimizer.cpp:3523
/home/cambricon/workspace_xi/ATC/glow/lib/Partitioner/Partitioner.cpp:405
/home/cambricon/workspace_xi/ATC/glow/examples/resnet-runtime.cpp:165
*** Check failure stack trace: ***
./run.sh: line 1: 21561 Aborted (core dumped) ./bin/resnet-runtime -backend=OpenCL -num-devices=1 -image-layout=NCHW

  1. revert GLOW version to the parent of ec46f24, the parent's commid ID is : bd69664
git reset --hard bd69664e1aae6f96ce84071bdcb9bef9180d6743
  1. build and run:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DGLOW_WITH_CPU=1 -DGLOW_WITH_OPENCL=1 ../glow
ninja all -j 64

./bin/resnet-runtime -backend=OpenCL -num-devices=1 -image-layout=NCHW

and it crushed after finish classifying images. So sad.

I1226 15:38:27.883890 23036 resnet-runtime.cpp:78] Loading resnet50 model.
I1226 15:38:28.876335 23036 resnet-runtime.cpp:164] Loading files from ../glow/tests/images/imagenet/
I1226 15:38:28.882439 23036 resnet-runtime.cpp:122] Started run ID: 0
I1226 15:38:28.887506 23036 resnet-runtime.cpp:122] Started run ID: 1
I1226 15:38:28.892138 23036 resnet-runtime.cpp:122] Started run ID: 2
(0) ../glow/tests/images/imagenet/cat_285.png: 281
(1) ../glow/tests/images/imagenet/dog_207.png: 207
(2) ../glow/tests/images/imagenet/zebra_340.png: 340
I1226 15:38:28.935782 23036 resnet-runtime.cpp:215] Finished classifying 3 images.

Thread 2 "resnet-runtime" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5301700 (LWP 23040)]
0x00007fffee574560 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
(gdb) bt
#0 0x00007fffee574560 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#1 0x00007fffee385373 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#2 0x00007fffee3736f5 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#3 0x00007fffee373ae7 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#4 0x00007fffee384720 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#5 0x0000000001fbf8fe in glow::runtime::OpenCLBuffer::~OpenCLBuffer() ()
#6 0x0000000001fc73cd in std::_Sp_counted_deleter<glow::runtime::OpenCLBuffer*, std::__shared_ptr<glow::runtime::OpenCLBuffer, (__gnu_cxx::_Lock_policy)2>::_Deleter<std::allocatorglow::runtime::OpenCLBuffer >, std::allocatorglow::runtime::OpenCLBuffer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
#7 0x0000000001fc26dc in glow::runtime::OpenCLDeviceManager::evictNetworkImpl(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, glow::detail::GlowError)>) ()
#8 0x000000000065ef48 in glow::runtime::QueueBackedDeviceManager::evictNetwork(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, glow::detail::GlowError)>)::{lambda()#1}::operator()() const ()
#9 0x000000000065ee2a in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<glow::runtime::QueueBackedDeviceManager::evictNetwork(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, glow::detail::GlowError)>)::{lambda()#1}, std::allocator, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) ()
#10 0x00000000004b7257 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) ()
#11 0x00007ffff6b00827 in __pthread_once_slow (once_control=0x9d59408, init_routine=0x7ffff602d830 <__once_proxy>) at pthread_once.c:116
#12 0x000000000065ecd1 in std::__future_base::_Task_state<glow::runtime::QueueBackedDeviceManager::evictNetwork(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, glow::detail::GlowError)>)::{lambda()#1}, std::allocator, void ()>::_M_run() ()
#13 0x00000000021c81b0 in glow::ThreadExecutor::threadPoolWorkerMain() ()
#14 0x00007ffff602e66f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007ffff6af86db in start_thread (arg=0x7ffff5301700) at pthread_create.c:463
#16 0x00007ffff5a8988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@XiZiler
Copy link

XiZiler commented Dec 28, 2019

[Update] I just remove related codes for verifying layout, it works and get the same result as CPU backend's. I don't know if this will lead to other problems.

This is what I removed in the file ./lib/Optimizer/GraphOptimizer/GraphOptimizer.cpp:

glow::optimizeFunction(Function *F, const Backend &B,CompilationContext &cctx) {
     ...
    
     // if (!B.verify(*F)) {
     //   return MAKE_ERR(
     //       ErrorValue::ErrorCode::COMPILE_UNSUPPORTED_NODE_AFTER_OPTIMIZE,
     //       "Unsupported node(s) found after optimizing Function " +
     //           F->getName().str() + " for backend " + B.getBackendName());
     // }
    return Error::success();
}

It seems have done all the lowering and optimization, and crushed during the very last verify of layouts. So I removed the last verify and it can still work.

vdantu pushed a commit to vdantu/glow that referenced this issue Jul 12, 2020
…osing the input to NCHW (pytorch#3951)

Summary:
See the in-source comment for workaround information, but: We have a model we load from the outside with no-way of knowing the constant/placeholder input layout, the default assumption for 4-D tensors (images) is NHWC format which is the canonical Glow format, PNG files are in NHWC format.
Our image loader, when using the `image-layout` flag, transposes the image outside the Glow graph, since there's no easy way to propagate that information, weaken the OpenCL verifier, not the canonical verifier: for placeholders and constants, assume that the loader knows what it is doing and they are in the right format.

Fixes pytorch#3815
Pull Request resolved: pytorch#3951

Test Plan: `ninja test`

Differential Revision: D19252774

Pulled By: shajrawi

fbshipit-source-id: f850c504245ee947794446b144b00df635a68497
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants