Skip to content

Improve error handling in make_dataset #3496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 24, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 3, 2021

Closes #3495, fixes #2903.

with self.assertRaises(RuntimeError):
with self.assertRaises(FileNotFoundError):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BC breaking if someone relies on DatasetFolder or ImageFolder to raise a RuntimeError if no samples are found. Since the error type was never documented anywhere, I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is fine but might be worth checking if it breaks something in FBCODE.

if class_to_idx is None:
_, class_to_idx = find_classes(directory)
elif not class_to_idx:
raise ValueError("'class_to_index' must have at least one entry to collect any samples.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is BC breaking, but in a "right" way. Before make_dataset silently returned an empty list in case class_to_idx was empty. IMO it is a reasonable assumption that no user calls make_dataset without having at least a single class. If you disagree with this assumption, we can get BC back by return [] here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, especially for the folder dataset.
Is there any other dataset that inherits this other than videodatasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other dataset that inherits this other than videodatasets?

Nothing built-in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier DatasetFolder.make_dataset is public and the docstring is:

class_to_idx (Optional[Dict[str, int]]): Dictionary mapping class name to class index. If omitted, is generated by :func:`find_classes`

The problem is that users can override DatasetFolder.find_classes too, so there's a conflict with the find_classes() function: DatasetFolder.find_classes will rely on the function rather than on the method.

Should we make class_to_idx a mandatory parameter in DatasetFolder.make_dataset to avoid any potential issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better explain myself: imagine a scenario where someone has a class MyCoolNewDataset(DatasetFolder) and they override MyCoolNewDataset.find_classes with a custom class_to_idx logic.

If they call MyCoolNewDataset.make_dataset while passing None to the class_to_idx parameter, what will be used is the class_to_idx logic from the find_classes function, instead of using the logic from the find_classes method - which is different since they overrode it.

Does that make sense?

To avoid such issues I'm suggesting to force the user to pass class_to_idx in DatasetFolder.make_dataset, or more accurately to raise an error if None is passed in DatasetFolder.make_dataset

@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #3496 (87f151a) into master (19ad0bb) will increase coverage by 0.16%.
The diff coverage is 68.96%.

❗ Current head 87f151a differs from pull request most recent head 9fa4a7a. Consider uploading reports for the commit 9fa4a7a to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3496      +/-   ##
==========================================
+ Coverage   78.83%   78.99%   +0.16%     
==========================================
  Files         105      105              
  Lines        9816     9793      -23     
  Branches     1581     1573       -8     
==========================================
- Hits         7738     7736       -2     
+ Misses       1588     1573      -15     
+ Partials      490      484       -6     
Impacted Files Coverage Δ
torchvision/datasets/folder.py 78.57% <60.86%> (-7.48%) ⬇️
torchvision/datasets/hmdb51.py 94.82% <100.00%> (-0.18%) ⬇️
torchvision/datasets/kinetics.py 95.65% <100.00%> (-0.35%) ⬇️
torchvision/datasets/ucf101.py 93.33% <100.00%> (-0.29%) ⬇️
torchvision/utils.py 59.57% <0.00%> (-2.53%) ⬇️
torchvision/transforms/transforms.py 84.30% <0.00%> (ø)
torchvision/datasets/voc.py 94.50% <0.00%> (+0.06%) ⬆️
torchvision/transforms/functional_tensor.py 79.84% <0.00%> (+0.56%) ⬆️
torchvision/models/detection/anchor_utils.py 94.66% <0.00%> (+1.33%) ⬆️
torchvision/ops/deform_conv.py 72.30% <0.00%> (+3.07%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19ad0bb...9fa4a7a. Read the comment docs.

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Should we add a test for make-dataset with no classes as well just to be on the safe side?

Also, I've added a comment, but are there any other datasets that rely on folder dataset?

if class_to_idx is None:
_, class_to_idx = find_classes(directory)
elif not class_to_idx:
raise ValueError("'class_to_index' must have at least one entry to collect any samples.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, especially for the folder dataset.
Is there any other dataset that inherits this other than videodatasets?

with self.assertRaises(RuntimeError):
with self.assertRaises(FileNotFoundError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is fine but might be worth checking if it breaks something in FBCODE.

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my questions answered. LGTM

@pmeier pmeier requested a review from fmassa March 16, 2021 13:34
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +205 to +206
@staticmethod
def _find_classes(dir: str) -> Tuple[List[str], Dict[str, int]]:
Copy link
Member

@fmassa fmassa Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't have made this a staticmethod, as other instantiations of this dataset could rely on self for generating a custom set of classes and class ids

But this is not really important, so let's move forward with this PR and get this merged

@fmassa fmassa merged commit 0818c68 into pytorch:master Mar 24, 2021
@pmeier pmeier deleted the improve-make-dataset branch March 24, 2021 11:59
facebook-github-bot pushed a commit that referenced this pull request Apr 1, 2021
Summary:
* factor out find_classes

* use find_classes in video datasets

* adapt old tests

Reviewed By: fmassa

Differential Revision: D27433918

fbshipit-source-id: 60d8da2f222a19e0757197f5d38b6a9cce7694a8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error handling for empty directories in make_dataset Raise an error if Kinetics400 dataset is empty
6 participants