Skip to content

Questions about prototype builtin datasets using torchdata #7609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ain-soph opened this issue May 20, 2023 · 1 comment
Closed

Questions about prototype builtin datasets using torchdata #7609

ain-soph opened this issue May 20, 2023 · 1 comment

Comments

@ain-soph
Copy link
Contributor

ain-soph commented May 20, 2023

Hi all, I'm currently exploring builtin datasets with new standards:
https://github.com/pytorch/vision/blob/main/torchvision/prototype/datasets

Let's take Cifar10 as an example. I have several questions:

  1. Why are all datasets constructed as iter rather than map style? When I have an index (e.g., 2331), I can no longer use dataset[2331] like the old CIFAR10.
    In this case, how to get_item for the new format dataset? Do I have to use IterToMapConverter? That'll be quite strange because raw data format is map, I make it iter and traverse to change back to map.
  2. What does hint_shuffling do?
    def hint_shuffling(datapipe: IterDataPipe[D]) -> Shuffler[D]:
        return Shuffler(datapipe, buffer_size=INFINITE_BUFFER_SIZE).set_shuffle(False)
    It's used in all prototype datasets. It seems to wrap datapipe with a shuffler but set_shuffle(False). That seems doing nothing?
  3. When to use Decompressor and set resource.preprocess='decompress' or 'extract'?
    What's the difference among Decompressor, resource.preprocess='decompress', resource.preprocess='extract' and using nothing?
    • Cifar10 resource is a cifar-10-python.tar.gz and sets nothing. It will default call _guess_archive_loader in OnlineResource.load to generate a TarArchiveLoader
    • MNIST resource is a train-images-idx3-ubyte.gz and uses a Decompressor
    • cub200 resource is a CUB_200_2011.tgz uses decompress=True
  4. How to use Transform in the new dataset API? such as AutoAugment or RandomCrop? Especially about ToTensor or transforms.PILToTensor(), transforms.ConvertImageDtype(torch.float) (since prototype dataset returns uint8 Tensor). From the Transform V2 Tutorial Page, I may assume that transform is no longer embedded in Dataset because it doesn't accept transform or target_transform args? Then how can I fetch augmented data from the DataLoader?
  5. For dataset that each image is stored in encoded image format (the old ImageFolder type. e.g., ImageNet, GTSRB),the output image format is EncodedImage -> EncodedData -> Datapoint. For dataset stored in binary (e.g., MNIST and CIFAR), the output image format is Image -> Datapoint. Why are they different? I see most transform V2 APIs are conducted on Image. Why is EncodedImage used here?
@NicolasHug
Copy link
Member

Hi @ain-soph and sorry for the silence... I was kind of waiting for this to be finally announced officially: https://github.com/pytorch/data/#torchdata-see-note-below-on-current-status

I'll try to provide very brief answers to your questions below

  1. It's never been clear to me why map-style datapipes even exist
  2. the shuffleing hints makes sure shuffling happens where it needs to happens if users set shuffle=True in the dataloader. It's pretty awful. But shuffling absolutely needs to happen before sharding (Notes on shuffling, sharding, and batchsize data#302) and this is the only way we found to prevent users shooting themselves in the foot.
  3. Honestly, IDK. I don't recommend relying on this
  4. How to use Transform in the new dataset API? Don't - we're not gonna release those datasets anytime soon
  5. Why is EncodedImage used here? Not sure honestly, probably relics of past designs that we haven't updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants