Skip to content

add API for new style datasets #4473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 24, 2021
Merged

add API for new style datasets #4473

merged 3 commits into from
Sep 24, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Sep 23, 2021

This PR is the second precursor for the new style datasets. The biggest change is that we are moving away from map-like datasets, i.e. with __getitem__, to iter-like datasets. This has a lot of advantages (and of course also some disadvantages) that I'm not going to list here, but rather in a post when all of this is officially announced.

How do we use the new style datasets?

For users the biggest change is that datasets are no longer instantiated from specific classes, but torchvision.prototype.datasets API now has five top-level functions:

  1. home(): In contrast to the current behavior, the new style datasets will handle storing the data automatically. home() returns the root of the tree, which defaults to ~/.cache/torch/datasets/vision. By passing a path to home() the root can be moved to any other location.
  2. list(): Returns a list of all available datasets.
  3. info(name): Returns static info about a dataset like the name, homepage, categories, or the available options to load the dataset.
  4. load(name, **kwargs): Return the given dataset as datapipe, in which each sample is a dictionary containing all the information.
  5. register(): This is used to register datasets (more on what this actually is later) with our API. Since datasets.load() is a thin convenience wrapper, we might also not expose this and only use it for builtin datasets (which are not part pf this PR).

So the workflow that currently looks like this

from torchvision.datasets import MNIST

dataset = MNIST(root="/path/to/root", train=False)

will now look like

from torchvision.prototype import datasets

dataset = datasets.load("mnist", split="test")

(note that the example cannot be run yet, because there are no built-in datasets in this PR)

How do we implement the new style datasets?

There are three closely connected objects that are all importable from torchvision.prototype.datasets.utils. The first two will only be used in most cases:

  • DatasetInfo: Static information about the dataset. See datasets.info() above.
  • DatasetConfig: Namespace containing the configuration of the dataset, for example which split to use.

The last object, Dataset, is actually the one that actually holds the information about how to load the dataset. It has three abstract methods that need to be overwritten:

  1. info (property): Returns the DatasetInfo with the static information about the dataset. This is for example used to instantiate the DatasetConfig from the additional parameters in datasets.load()
  2. resources: Returns a list of OnlineResource's that need to be present to load the current config. They will be downloaded and checked automatically.
  3. _make_datapipe: The heart of the implementation. The method receives a list of datapipes, which correspond to the output of resources, the current config, as well as optionally a shuffler and decoder. Implementation is highly dataset specific, so examples for this will only be showcased in follow-up PRs that add some fundamental datasets.

cc @pmeier @mthrok

return f"{prefix}\n{body}\n{postfix}"


class DatasetConfig(Mapping):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class looks a lot like overkill, but it is actually not:

  1. Since lambda's are "not welcome" in datapipes that take a callable (a warning will be thrown), we often need to use functools.partial(..., config=config). To do that config has to be hashable. Hence, we implement the __hash__ and __eq__ methods and fail in case __getitem__ or __delitem__ are invoked.
  2. For convenience it is much nicer to use a DatasetConfig as namespace as a dictionary, so we additionally implement __getattr__, __setattr__, and __delattr__.

Comment on lines +139 to +158
for name, arg in options.items():
if name not in self._valid_options:
raise ValueError(
add_suggestion(
f"Unknown option '{name}' of dataset {self.name}.",
word=name,
possibilities=sorted(self._valid_options.keys()),
)
)

valid_args = self._valid_options[name]

if arg not in valid_args:
raise ValueError(
add_suggestion(
f"Invalid argument '{arg}' for option '{name}' of dataset {self.name}.",
word=arg,
possibilities=valid_args,
)
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to a dedicated function, so we have the ability to override the default option checking.

def to_datapipe(self, root: Union[str, pathlib.Path]) -> IterDataPipe:
path = (pathlib.Path(root) / self.file_name).expanduser().resolve()
# FIXME
return FileLoader(IterableWrapper((str(path),)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is greatest weak point right now. Functionality to download stuff and verify the checksum is present in torchdata, but is not public yet.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few comments but they can be addressed in follow-up PRs.

Thanks!

@fmassa fmassa merged commit 9252436 into pytorch:main Sep 24, 2021
@pmeier pmeier deleted the datasets/api branch September 24, 2021 13:34
facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2021
Summary:
* add API for new style datasets

* cleanup

Reviewed By: datumbox

Differential Revision: D31268026

fbshipit-source-id: 6bab56ddc0e3b18e1996300d8f472353daf76821

Co-authored-by: Francisco Massa <[email protected]>
husthyc added a commit that referenced this pull request Oct 22, 2021
* add API for new style datasets

* cleanup

Co-authored-by: Francisco Massa <[email protected]>

[ghstack-poisoned]
husthyc added a commit that referenced this pull request Oct 22, 2021
* add API for new style datasets

* cleanup

Co-authored-by: Francisco Massa <[email protected]>

[ghstack-poisoned]
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add API for new style datasets

* cleanup

Co-authored-by: Francisco Massa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants