Skip to content

Raise an error if Kinetics400 dataset is empty #2903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfdev-5 opened this issue Oct 27, 2020 · 2 comments · Fixed by #3496
Closed

Raise an error if Kinetics400 dataset is empty #2903

vfdev-5 opened this issue Oct 27, 2020 · 2 comments · Fixed by #3496

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 27, 2020

🚀 Feature

Currently, if we construct a dataset on empty folder, it would be nice to have an error. Testing the length of the dataset is failing:

import torchvision
torchvision.__version__
> '0.8.0a0+e280f61'
dataset = torchvision.datasets.Kinetics400("/tmp/", frames_per_clip=10, step_between_clips=1, frame_rate=15)
len(dataset)
> Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/torchvision/torchvision/datasets/kinetics.py", line 70, in __len__
    return self.video_clips.num_clips()
  File "/torchvision/torchvision/datasets/video_utils.py", line 247, in num_clips
    return self.cumulative_sizes[-1]
IndexError: list index out of range

Additional context

Without any errors related to empty dataset, reference video_classification example is failing inside the random sampler with misleading error:

  File "/vision/torchvision/datasets/samplers/clip_sampler.py", line 175, in __iter__                                                                               
    idxs_ = torch.cat(idxs)                                                                                                                                                 
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  Th
is usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCP
U, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].   

cc @pmeier @bjuncek

@pmeier
Copy link
Collaborator

pmeier commented Feb 25, 2021

Internally Kinetics400 uses datasets.folder.make_dataset() to collect the samples:

classes = list(sorted(list_dir(root)))
class_to_idx = {classes[i]: i for i in range(len(classes))}
self.samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file=None)

We have an error handling like you propose in datasets.DatasetFolder

samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)
if len(samples) == 0:
msg = "Found 0 files in subfolders of: {}\n".format(self.root)
if extensions is not None:
msg += "Supported extensions are: {}".format(",".join(extensions))
raise RuntimeError(msg)

Since this also relies on make_dataset() I'm wondering if it would make sense to move this check into make_dataset() instead implement it in every dataset that uses it. @fmassa?

@fmassa
Copy link
Member

fmassa commented Mar 2, 2021

@pmeier sure, makes sense to move this check to be inside make_dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants