Skip to content

the error when I run the example for the imagenet #544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
runzeer opened this issue Apr 14, 2019 · 4 comments
Closed

the error when I run the example for the imagenet #544

runzeer opened this issue Apr 14, 2019 · 4 comments

Comments

@runzeer
Copy link

runzeer commented Apr 14, 2019

When I tried to run the model for the example/imagenet, I encounter such error.So could you tell me how to solve the problem?

python /home/zrz/code/imagenet_dist/examples-master/imagenet/main.py -a resnet18 -/home/zrz/dataset/imagenet/imagenet2012/ILSVRC2012/raw-data/imagenet-data

=> creating model 'resnet18'

Epoch: [0][ 0/320292] Time 3.459 ( 3.459) Data 0.295 ( 0.295) Loss 7.2399e+00 (7.2399e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 10/320292] Time 0.043 ( 0.357) Data 0.000 ( 0.027) Loss 9.4861e+00 (1.3169e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 20/320292] Time 0.046 ( 0.209) Data 0.000 ( 0.014) Loss 7.3722e+00 (1.0817e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 30/320292] Time 0.032 ( 0.154) Data 0.000 ( 0.010) Loss 6.9166e+00 (9.5394e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.

Traceback (most recent call last):

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 417, in

main()

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 113, in main

main_worker(args.gpu, ngpus_per_node, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 239, in main_worker

train(train_loader, model, criterion, optimizer, epoch, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 286, in train

losses.update(loss.item(), input.size(0))

RuntimeError: CUDA error: device-side assert triggered

terminate called after throwing an instance of 'c10::Error'

what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCCachingAllocator.cpp:470)

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f099a50acf5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libc10.so)

frame #1: + 0x123b8c0 (0x7f099e7ee8c0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)

frame #2: at::TensorImpl::release_resources() + 0x50 (0x7f099ac76c30 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2.so)

frame #3: + 0x2a836b (0x7f099818b36b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #4: + 0x30eff0 (0x7f09981f1ff0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #5: torch::autograd::deleteFunction(torch::autograd::Function*) + 0x2f0 (0x7f099818dd70 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7f09c17f87f5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7f09984001ba in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #8: + 0x12148b (0x7f09c181048b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #9: + 0x31a49f (0x7f09c1a0949f in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #10: + 0x31a4e1 (0x7f09c1a094e1 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #11: + 0x1993cf (0x5574e4c9a3cf in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #12: + 0xf12b7 (0x5574e4bf22b7 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #13: + 0xf1147 (0x5574e4bf2147 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #14: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #15: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #16: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #17: PyDict_SetItem + 0x3da (0x5574e4c37e7a in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #18: PyDict_SetItemString + 0x4f (0x5574e4c4078f in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #19: PyImport_Cleanup + 0x99 (0x5574e4ca4709 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #20: Py_FinalizeEx + 0x61 (0x5574e4d105f1 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #21: Py_Main + 0x35e (0x5574e4d1b1fe in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #22: main + 0xee (0x5574e4be402e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #23: __libc_start_main + 0xf5 (0x7f09d9c2e3d5 in /lib64/libc.so.6)

frame #24: + 0x1c3e0e (0x5574e4cc4e0e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

@tuji-sjp
Copy link

When I tried to run the model for the example/imagenet, I encounter such error.So could you tell me how to solve the problem?

python /home/zrz/code/imagenet_dist/examples-master/imagenet/main.py -a resnet18 -/home/zrz/dataset/imagenet/imagenet2012/ILSVRC2012/raw-data/imagenet-data

=> creating model 'resnet18'

Epoch: [0][ 0/320292] Time 3.459 ( 3.459) Data 0.295 ( 0.295) Loss 7.2399e+00 (7.2399e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 10/320292] Time 0.043 ( 0.357) Data 0.000 ( 0.027) Loss 9.4861e+00 (1.3169e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 20/320292] Time 0.046 ( 0.209) Data 0.000 ( 0.014) Loss 7.3722e+00 (1.0817e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 30/320292] Time 0.032 ( 0.154) Data 0.000 ( 0.010) Loss 6.9166e+00 (9.5394e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.

Traceback (most recent call last):

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 417, in

main()

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 113, in main

main_worker(args.gpu, ngpus_per_node, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 239, in main_worker

train(train_loader, model, criterion, optimizer, epoch, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 286, in train

losses.update(loss.item(), input.size(0))

RuntimeError: CUDA error: device-side assert triggered

terminate called after throwing an instance of 'c10::Error'

what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCCachingAllocator.cpp:470)

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f099a50acf5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libc10.so)

frame #1: + 0x123b8c0 (0x7f099e7ee8c0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)

frame #2: at::TensorImpl::release_resources() + 0x50 (0x7f099ac76c30 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2.so)

frame #3: + 0x2a836b (0x7f099818b36b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #4: + 0x30eff0 (0x7f09981f1ff0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #5: torch::autograd::deleteFunction(torch::autograd::Function*) + 0x2f0 (0x7f099818dd70 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7f09c17f87f5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7f09984001ba in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #8: + 0x12148b (0x7f09c181048b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #9: + 0x31a49f (0x7f09c1a0949f in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #10: + 0x31a4e1 (0x7f09c1a094e1 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #11: + 0x1993cf (0x5574e4c9a3cf in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #12: + 0xf12b7 (0x5574e4bf22b7 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #13: + 0xf1147 (0x5574e4bf2147 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #14: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #15: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #16: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #17: PyDict_SetItem + 0x3da (0x5574e4c37e7a in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #18: PyDict_SetItemString + 0x4f (0x5574e4c4078f in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #19: PyImport_Cleanup + 0x99 (0x5574e4ca4709 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #20: Py_FinalizeEx + 0x61 (0x5574e4d105f1 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #21: Py_Main + 0x35e (0x5574e4d1b1fe in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #22: main + 0xee (0x5574e4be402e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #23: __libc_start_main + 0xf5 (0x7f09d9c2e3d5 in /lib64/libc.so.6)

frame #24: + 0x1c3e0e (0x5574e4cc4e0e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

Hello, perhaps you know how to download the ImageNet dataset for this program to use?
Please tell me, thank you very much!

@lartpang
Copy link

lartpang commented May 25, 2019

@runzeer

In the ClassNLLCriterion kernel, the Assertion t >= 0 && t < n_classes failed. so I guess one of the elements of target_var is either smaller than 0 or larger than the number of classes (output size) :slight_smile: You might want to check the content of your labels for your dataset, one look like it’s not valid.

https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59-device-side-assert-triggered-at-pytorch-torch-lib-thc-generic-thcstorage-c-36/17442

Maybe the number of classes in your datasets is not 1000, so you should change it...

like this:

class TotalModel(nn.Module):
    def __init__(self, num_class=1000):
        super(TotalModel, self).__init__()
        net = resnet50(pretrained=True)
        self.div_32 = nn.Sequential(*list(net.children())[:-1])
        self.other_layers = nn.Linear(2048, num_class)

    def forward(self, in_feat):
        in_feat = self.div_32(in_feat)
        in_feat = in_feat.view(in_feat.size(0), -1)
        in_feat = self.other_layers(in_feat)
        return in_feat

if __name__ == '__main__':
    in_data = torch.randn(4, 3, 224, 224)
    net = TotalModel()
    out = net(in_data)
    print(out.size())

@chaeeon-lim
Copy link

I have the same issue. Is there anyone who solved this issue? Please help me.

@msaroufim
Copy link
Member

@lartpang seems to have the correct suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants