Skip to content

Cleanups for FLAVA datasets #5164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 20, 2022
Merged

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jan 5, 2022

Towards the end of #5108

All datasets but 2 have download=False as the default, so this PR sets the default to False as well for Food101 and DTD for consistency. It also documents the download parameter for Food101 which was missing from the Docstring.

See #5164 (comment) for complete set of changes

cc @pmeier

@NicolasHug NicolasHug added module: datasets other if you have no clue or if you will manually handle the PR in the release notes labels Jan 5, 2022
@facebook-github-bot
Copy link

facebook-github-bot commented Jan 5, 2022

💊 CI failures summary and remediations

As of commit 3c70d81 (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build unittest_linux_cpu_py3.7 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/root/project/torchvision/io/video.py:406: Runt...log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found
test/test_image.py::test_decode_png[L-ImageReadMode.GRAY-palette_pytorch.png]
test/test_image.py::test_decode_png[RGB-ImageReadMode.RGB-palette_pytorch.png]
  /root/project/env/lib/python3.7/site-packages/PIL/Image.py:946: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
    "Palette images with Transparency expressed in bytes should be "

test/test_io.py::TestVideo::test_probe_video_from_memory
  /root/project/torchvision/io/_video_opt.py:423: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /opt/conda/conda-bld/pytorch_1642552271656/work/torch/csrc/utils/tensor_new.cpp:998.)
    video_data = torch.frombuffer(video_data, dtype=torch.uint8)

test/test_io.py::TestVideo::test_read_video_timestamps_corrupted_file
  /root/project/torchvision/io/video.py:406: RuntimeWarning: Failed to open container for /tmp/tmprbxm6fs4.mp4; Caught error: [Errno 1094995529] Invalid data found when processing input: '/tmp/tmprbxm6fs4.mp4'; last error log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found
    warnings.warn(msg, RuntimeWarning)

test/test_models.py::test_memory_efficient_densenet[densenet121]
test/test_models.py::test_memory_efficient_densenet[densenet169]
test/test_models.py::test_memory_efficient_densenet[densenet201]
test/test_models.py::test_memory_efficient_densenet[densenet161]
  /root/project/env/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
    warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

test/test_models.py::test_inception_v3_eval

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet.


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pmeier pmeier mentioned this pull request Jan 5, 2022
@NicolasHug
Copy link
Member Author

NicolasHug commented Jan 5, 2022

Thanks for the review @pmeier . Following up on your #5130 (comment), let's use this PR to

  • make sure all datasets have download=False
  • make sure all datasets have download after the transforms parameter.
  • change train parameter into split for consistency across these datasets
  • change name of DTD parameter partition We kept partition in DTD and instead removed it from SUN397, along with its split parameter, because the train/test splits are only defined depending on the partition. Because each partition only contains a subset of the data, we decided not to include it, at least for now. On top of that, since our goal is to support the FLAVA implem, in the original implem https://github.com/facebookresearch/vissl/blob/main/extra_scripts/datasets/create_sun397_data_files.py#L92 they rely on a custom-made split. Because this split is arbitrary and non-standard, this isn't something we can support directly in torchvision.

I'll mark it as draft and we can come back to this once the rest of the PRs are merged.

EDIT: "all datasets" == all datasets that haven't been released yet.

@NicolasHug NicolasHug marked this pull request as draft January 5, 2022 15:31
@pmeier
Copy link
Collaborator

pmeier commented Jan 6, 2022

@NicolasHug You are only talking about the "FLAVA" datasets here, right? Because for other datasets that would be BC breaking and I want to avoid that, since we probably don't have time for a deprecation cycle before the API is deprecated in general.

@NicolasHug
Copy link
Member Author

Fully agreed @pmeier , sorry for not being clearer in my comment above

@pmeier
Copy link
Collaborator

pmeier commented Jan 17, 2022

import h5py # type: ignore[import]

is an anti-pattern. Since h5py has no annotations, it is better to ignore it globally rather than locally like

vision/mypy.ini

Lines 117 to 119 in 4946827

[mypy-torchdata.*]
ignore_missing_imports = True

@NicolasHug NicolasHug changed the title Change default of download for Food101 and DTD Cleanups for FLAVA datasets Jan 18, 2022
@NicolasHug NicolasHug marked this pull request as ready for review January 18, 2022 15:09
@pmeier pmeier self-requested a review January 18, 2022 16:14
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment inline. Plus let's also resolve #5220 (comment). Otherwise, LGTM if CI is green! Thanks @NicolasHug

for p in (self._base_folder / self._split_to_folder[self._split]).glob("**/*.png"):
self._labels.append(self.class_to_idx[p.parent.name])
self._image_files.append(p)
self._samples = make_dataset(str(self._base_folder / self._split_to_folder[self._split]), extensions="png")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._samples = make_dataset(str(self._base_folder / self._split_to_folder[self._split]), extensions="png")
self._samples = make_dataset(str(self._base_folder / self._split_to_folder[self._split]), extensions=(".png",))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it ultimately comes down to endswith which accepts tuples but also just plain strings.
I think the type annotations are incorrect here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me send a PR to fix that.

@NicolasHug
Copy link
Member Author

Failure is unrelated, I'll merge. Thanks for the review!

@NicolasHug NicolasHug merged commit e047623 into pytorch:main Jan 20, 2022
facebook-github-bot pushed a commit that referenced this pull request Jan 26, 2022
Summary:
* Change default of download for Food101 and DTD

* Set download default to False and put it at the end

* Keep stuff private

* GTSRB: train -> split. Also use pathlib

* mypy

* Remove split and partition for SUN397

* mypy

* mypy

* move download param for SST2

* Use make_dataset in SST2

* Use a base URL for GTSRB

* Let's make this code more complictaed than it needs to be because why not

Reviewed By: jdsgomes, prabhat00155

Differential Revision: D33739381

fbshipit-source-id: a2bcfcdc2296ffe62f8e75c8107ff1d0a87957f1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/default cla signed module: datasets other if you have no clue or if you will manually handle the PR in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants