add API for new style datasets #4473

pmeier · 2021-09-23T17:55:49Z

This PR is the second precursor for the new style datasets. The biggest change is that we are moving away from map-like datasets, i.e. with __getitem__, to iter-like datasets. This has a lot of advantages (and of course also some disadvantages) that I'm not going to list here, but rather in a post when all of this is officially announced.

How do we use the new style datasets?

For users the biggest change is that datasets are no longer instantiated from specific classes, but torchvision.prototype.datasets API now has five top-level functions:

home(): In contrast to the current behavior, the new style datasets will handle storing the data automatically. home() returns the root of the tree, which defaults to ~/.cache/torch/datasets/vision. By passing a path to home() the root can be moved to any other location.
list(): Returns a list of all available datasets.
info(name): Returns static info about a dataset like the name, homepage, categories, or the available options to load the dataset.
load(name, **kwargs): Return the given dataset as datapipe, in which each sample is a dictionary containing all the information.
register(): This is used to register datasets (more on what this actually is later) with our API. Since datasets.load() is a thin convenience wrapper, we might also not expose this and only use it for builtin datasets (which are not part pf this PR).

So the workflow that currently looks like this

from torchvision.datasets import MNIST

dataset = MNIST(root="/path/to/root", train=False)

will now look like

from torchvision.prototype import datasets

dataset = datasets.load("mnist", split="test")

(note that the example cannot be run yet, because there are no built-in datasets in this PR)

How do we implement the new style datasets?

There are three closely connected objects that are all importable from torchvision.prototype.datasets.utils. The first two will only be used in most cases:

DatasetInfo: Static information about the dataset. See datasets.info() above.
DatasetConfig: Namespace containing the configuration of the dataset, for example which split to use.

The last object, Dataset, is actually the one that actually holds the information about how to load the dataset. It has three abstract methods that need to be overwritten:

info (property): Returns the DatasetInfo with the static information about the dataset. This is for example used to instantiate the DatasetConfig from the additional parameters in datasets.load()
resources: Returns a list of OnlineResource's that need to be present to load the current config. They will be downloaded and checked automatically.
_make_datapipe: The heart of the implementation. The method receives a list of datapipes, which correspond to the output of resources, the current config, as well as optionally a shuffler and decoder. Implementation is highly dataset specific, so examples for this will only be showcased in follow-up PRs that add some fundamental datasets.

cc @pmeier @mthrok

torchvision/prototype/datasets/_api.py

pmeier · 2021-09-23T18:03:01Z

torchvision/prototype/datasets/utils/_dataset.py

+    return f"{prefix}\n{body}\n{postfix}"
+
+
+class DatasetConfig(Mapping):


This class looks a lot like overkill, but it is actually not:

Since lambda's are "not welcome" in datapipes that take a callable (a warning will be thrown), we often need to use functools.partial(..., config=config). To do that config has to be hashable. Hence, we implement the __hash__ and __eq__ methods and fail in case __getitem__ or __delitem__ are invoked.

For convenience it is much nicer to use a DatasetConfig as namespace as a dictionary, so we additionally implement __getattr__, __setattr__, and __delattr__.

pmeier · 2021-09-23T18:04:02Z

torchvision/prototype/datasets/utils/_dataset.py

+        for name, arg in options.items():
+            if name not in self._valid_options:
+                raise ValueError(
+                    add_suggestion(
+                        f"Unknown option '{name}' of dataset {self.name}.",
+                        word=name,
+                        possibilities=sorted(self._valid_options.keys()),
+                    )
+                )
+
+            valid_args = self._valid_options[name]
+
+            if arg not in valid_args:
+                raise ValueError(
+                    add_suggestion(
+                        f"Invalid argument '{arg}' for option '{name}' of dataset {self.name}.",
+                        word=arg,
+                        possibilities=valid_args,
+                    )
+                )


We can move this to a dedicated function, so we have the ability to override the default option checking.

pmeier · 2021-09-23T18:05:54Z

torchvision/prototype/datasets/utils/_resource.py

+    def to_datapipe(self, root: Union[str, pathlib.Path]) -> IterDataPipe:
+        path = (pathlib.Path(root) / self.file_name).expanduser().resolve()
+        # FIXME
+        return FileLoader(IterableWrapper((str(path),)))


This is greatest weak point right now. Functionality to download stuff and verify the checksum is present in torchdata, but is not public yet.

fmassa

I made a few comments but they can be addressed in follow-up PRs.

Thanks!

torchvision/prototype/datasets/_api.py

torchvision/prototype/datasets/_home.py

torchvision/prototype/datasets/utils/_dataset.py

Summary: * add API for new style datasets * cleanup Reviewed By: datumbox Differential Revision: D31268026 fbshipit-source-id: 6bab56ddc0e3b18e1996300d8f472353daf76821 Co-authored-by: Francisco Massa <[email protected]>

* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]

* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]>

add API for new style datasets

6284152

pmeier added module: datasets prototype labels Sep 23, 2021

pmeier requested a review from fmassa September 23, 2021 17:55

pmeier commented Sep 23, 2021

View reviewed changes

cleanup

2a85473

facebook-github-bot added the cla signed label Sep 23, 2021

fmassa approved these changes Sep 24, 2021

View reviewed changes

torchvision/prototype/datasets/_api.py Show resolved Hide resolved

torchvision/prototype/datasets/_api.py Show resolved Hide resolved

torchvision/prototype/datasets/_home.py Show resolved Hide resolved

torchvision/prototype/datasets/utils/_dataset.py Show resolved Hide resolved

Merge branch 'main' into datasets/api

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

c3e20b2

fmassa merged commit 9252436 into pytorch:main Sep 24, 2021

pmeier deleted the datasets/api branch September 24, 2021 13:34

pmeier mentioned this pull request Sep 27, 2021

cleanup prototype datasets #4471

Merged

husthyc added a commit that referenced this pull request Oct 22, 2021

add API for new style datasets (#4473)

8acb33e

* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]

husthyc added a commit that referenced this pull request Oct 22, 2021

add API for new style datasets (#4473)

2d91c26

* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]

cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021

add API for new style datasets (pytorch#4473)

a8565ed

* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add API for new style datasets #4473

add API for new style datasets #4473

pmeier commented Sep 23, 2021 •

edited

Loading

Uh oh!

Uh oh!

pmeier Sep 23, 2021

Uh oh!

pmeier Sep 23, 2021

Uh oh!

pmeier Sep 23, 2021

Uh oh!

fmassa left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		return f"{prefix}\n{body}\n{postfix}"


		class DatasetConfig(Mapping):

add API for new style datasets #4473

add API for new style datasets #4473

Conversation

pmeier commented Sep 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How do we use the new style datasets?

How do we implement the new style datasets?

Uh oh!

Uh oh!

pmeier Sep 23, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Sep 23, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Sep 23, 2021

Choose a reason for hiding this comment

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmeier commented Sep 23, 2021 •

edited

Loading