the error when I run the example for the imagenet #544

runzeer · 2019-04-14T06:22:38Z

When I tried to run the model for the example/imagenet, I encounter such error.So could you tell me how to solve the problem?

python /home/zrz/code/imagenet_dist/examples-master/imagenet/main.py -a resnet18 -/home/zrz/dataset/imagenet/imagenet2012/ILSVRC2012/raw-data/imagenet-data

=> creating model 'resnet18'

Epoch: [0][ 0/320292] Time 3.459 ( 3.459) Data 0.295 ( 0.295) Loss 7.2399e+00 (7.2399e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 10/320292] Time 0.043 ( 0.357) Data 0.000 ( 0.027) Loss 9.4861e+00 (1.3169e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 20/320292] Time 0.046 ( 0.209) Data 0.000 ( 0.014) Loss 7.3722e+00 (1.0817e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 30/320292] Time 0.032 ( 0.154) Data 0.000 ( 0.010) Loss 6.9166e+00 (9.5394e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.

Traceback (most recent call last):

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 417, in

main()

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 113, in main

main_worker(args.gpu, ngpus_per_node, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 239, in main_worker

train(train_loader, model, criterion, optimizer, epoch, args)

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 286, in train

losses.update(loss.item(), input.size(0))

RuntimeError: CUDA error: device-side assert triggered

terminate called after throwing an instance of 'c10::Error'

what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCCachingAllocator.cpp:470)

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f099a50acf5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libc10.so)

frame #1: + 0x123b8c0 (0x7f099e7ee8c0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)

frame #2: at::TensorImpl::release_resources() + 0x50 (0x7f099ac76c30 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2.so)

frame #3: + 0x2a836b (0x7f099818b36b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #4: + 0x30eff0 (0x7f09981f1ff0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #5: torch::autograd::deleteFunction(torch::autograd::Function*) + 0x2f0 (0x7f099818dd70 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7f09c17f87f5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7f09984001ba in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #8: + 0x12148b (0x7f09c181048b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #9: + 0x31a49f (0x7f09c1a0949f in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #10: + 0x31a4e1 (0x7f09c1a094e1 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #11: + 0x1993cf (0x5574e4c9a3cf in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #12: + 0xf12b7 (0x5574e4bf22b7 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #13: + 0xf1147 (0x5574e4bf2147 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #14: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #15: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #16: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #17: PyDict_SetItem + 0x3da (0x5574e4c37e7a in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #18: PyDict_SetItemString + 0x4f (0x5574e4c4078f in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #19: PyImport_Cleanup + 0x99 (0x5574e4ca4709 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #20: Py_FinalizeEx + 0x61 (0x5574e4d105f1 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #21: Py_Main + 0x35e (0x5574e4d1b1fe in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #22: main + 0xee (0x5574e4be402e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #23: __libc_start_main + 0xf5 (0x7f09d9c2e3d5 in /lib64/libc.so.6)

frame #24: + 0x1c3e0e (0x5574e4cc4e0e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

The text was updated successfully, but these errors were encountered:

tuji-sjp · 2019-04-22T16:53:50Z

When I tried to run the model for the example/imagenet, I encounter such error.So could you tell me how to solve the problem?

python /home/zrz/code/imagenet_dist/examples-master/imagenet/main.py -a resnet18 -/home/zrz/dataset/imagenet/imagenet2012/ILSVRC2012/raw-data/imagenet-data

=> creating model 'resnet18'

Epoch: [0][ 0/320292] Time 3.459 ( 3.459) Data 0.295 ( 0.295) Loss 7.2399e+00 (7.2399e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 10/320292] Time 0.043 ( 0.357) Data 0.000 ( 0.027) Loss 9.4861e+00 (1.3169e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 20/320292] Time 0.046 ( 0.209) Data 0.000 ( 0.014) Loss 7.3722e+00 (1.0817e+01) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

Epoch: [0][ 30/320292] Time 0.032 ( 0.154) Data 0.000 ( 0.010) Loss 6.9166e+00 (9.5394e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.

Traceback (most recent call last):

File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 417, in
main()
File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 113, in main
main_worker(args.gpu, ngpus_per_node, args)
File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 239, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/zrz/code/imagenet_dist/examples-master/imagenet/main.py", line 286, in train
losses.update(loss.item(), input.size(0))
RuntimeError: CUDA error: device-side assert triggered

terminate called after throwing an instance of 'c10::Error'

what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCCachingAllocator.cpp:470)

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f099a50acf5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libc10.so)

frame #1: + 0x123b8c0 (0x7f099e7ee8c0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)

frame #2: at::TensorImpl::release_resources() + 0x50 (0x7f099ac76c30 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libcaffe2.so)

frame #3: + 0x2a836b (0x7f099818b36b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #4: + 0x30eff0 (0x7f09981f1ff0 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #5: torch::autograd::deleteFunction(torch::autograd::Function*) + 0x2f0 (0x7f099818dd70 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7f09c17f87f5 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7f09984001ba in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

frame #8: + 0x12148b (0x7f09c181048b in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #9: + 0x31a49f (0x7f09c1a0949f in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #10: + 0x31a4e1 (0x7f09c1a094e1 in /home/zrz/miniconda3/envs/runze_env_name/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #11: + 0x1993cf (0x5574e4c9a3cf in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #12: + 0xf12b7 (0x5574e4bf22b7 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #13: + 0xf1147 (0x5574e4bf2147 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #14: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #15: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #16: + 0xf115d (0x5574e4bf215d in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #17: PyDict_SetItem + 0x3da (0x5574e4c37e7a in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #18: PyDict_SetItemString + 0x4f (0x5574e4c4078f in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #19: PyImport_Cleanup + 0x99 (0x5574e4ca4709 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #20: Py_FinalizeEx + 0x61 (0x5574e4d105f1 in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #21: Py_Main + 0x35e (0x5574e4d1b1fe in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #22: main + 0xee (0x5574e4be402e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

frame #23: __libc_start_main + 0xf5 (0x7f09d9c2e3d5 in /lib64/libc.so.6)

frame #24: + 0x1c3e0e (0x5574e4cc4e0e in /home/zrz/miniconda3/envs/runze_env_name/bin/python3.6)

Hello, perhaps you know how to download the ImageNet dataset for this program to use?
Please tell me, thank you very much!

lartpang · 2019-05-25T13:32:28Z

@runzeer

In the ClassNLLCriterion kernel, the Assertion t >= 0 && t < n_classes failed. so I guess one of the elements of target_var is either smaller than 0 or larger than the number of classes (output size) :slight_smile: You might want to check the content of your labels for your dataset, one look like it’s not valid.

https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59-device-side-assert-triggered-at-pytorch-torch-lib-thc-generic-thcstorage-c-36/17442

Maybe the number of classes in your datasets is not 1000, so you should change it...

like this:

class TotalModel(nn.Module):
    def __init__(self, num_class=1000):
        super(TotalModel, self).__init__()
        net = resnet50(pretrained=True)
        self.div_32 = nn.Sequential(*list(net.children())[:-1])
        self.other_layers = nn.Linear(2048, num_class)

    def forward(self, in_feat):
        in_feat = self.div_32(in_feat)
        in_feat = in_feat.view(in_feat.size(0), -1)
        in_feat = self.other_layers(in_feat)
        return in_feat

if __name__ == '__main__':
    in_data = torch.randn(4, 3, 224, 224)
    net = TotalModel()
    out = net(in_data)
    print(out.size())

chaeeon-lim · 2020-03-09T12:07:03Z

I have the same issue. Is there anyone who solved this issue? Please help me.

msaroufim · 2022-03-10T05:56:43Z

@lartpang seems to have the correct suggestion

msaroufim closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the error when I run the example for the imagenet #544

the error when I run the example for the imagenet #544

runzeer commented Apr 14, 2019

tuji-sjp commented Apr 22, 2019

lartpang commented May 25, 2019 •

edited

Loading

chaeeon-lim commented Mar 9, 2020

msaroufim commented Mar 10, 2022

the error when I run the example for the imagenet #544

the error when I run the example for the imagenet #544

Comments

runzeer commented Apr 14, 2019

tuji-sjp commented Apr 22, 2019

lartpang commented May 25, 2019 • edited Loading

chaeeon-lim commented Mar 9, 2020

msaroufim commented Mar 10, 2022

lartpang commented May 25, 2019 •

edited

Loading