-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add API for new style datasets #4473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return f"{prefix}\n{body}\n{postfix}" | ||
|
||
|
||
class DatasetConfig(Mapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class looks a lot like overkill, but it is actually not:
- Since
lambda
's are "not welcome" in datapipes that take a callable (a warning will be thrown), we often need to usefunctools.partial(..., config=config)
. To do thatconfig
has to be hashable. Hence, we implement the__hash__
and__eq__
methods and fail in case__getitem__
or__delitem__
are invoked. - For convenience it is much nicer to use a
DatasetConfig
as namespace as a dictionary, so we additionally implement__getattr__
,__setattr__
, and__delattr__
.
for name, arg in options.items(): | ||
if name not in self._valid_options: | ||
raise ValueError( | ||
add_suggestion( | ||
f"Unknown option '{name}' of dataset {self.name}.", | ||
word=name, | ||
possibilities=sorted(self._valid_options.keys()), | ||
) | ||
) | ||
|
||
valid_args = self._valid_options[name] | ||
|
||
if arg not in valid_args: | ||
raise ValueError( | ||
add_suggestion( | ||
f"Invalid argument '{arg}' for option '{name}' of dataset {self.name}.", | ||
word=arg, | ||
possibilities=valid_args, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to a dedicated function, so we have the ability to override the default option checking.
def to_datapipe(self, root: Union[str, pathlib.Path]) -> IterDataPipe: | ||
path = (pathlib.Path(root) / self.file_name).expanduser().resolve() | ||
# FIXME | ||
return FileLoader(IterableWrapper((str(path),))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is greatest weak point right now. Functionality to download stuff and verify the checksum is present in torchdata
, but is not public yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few comments but they can be addressed in follow-up PRs.
Thanks!
Summary: * add API for new style datasets * cleanup Reviewed By: datumbox Differential Revision: D31268026 fbshipit-source-id: 6bab56ddc0e3b18e1996300d8f472353daf76821 Co-authored-by: Francisco Massa <[email protected]>
* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]
* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]
* add API for new style datasets * cleanup Co-authored-by: Francisco Massa <[email protected]>
This PR is the second precursor for the new style datasets. The biggest change is that we are moving away from map-like datasets, i.e. with
__getitem__
, to iter-like datasets. This has a lot of advantages (and of course also some disadvantages) that I'm not going to list here, but rather in a post when all of this is officially announced.How do we use the new style datasets?
For users the biggest change is that datasets are no longer instantiated from specific classes, but
torchvision.prototype.datasets
API now has five top-level functions:home()
: In contrast to the current behavior, the new style datasets will handle storing the data automatically.home()
returns the root of the tree, which defaults to~/.cache/torch/datasets/vision
. By passing a path tohome()
the root can be moved to any other location.list()
: Returns a list of all available datasets.info(name)
: Returns static info about a dataset like the name, homepage, categories, or the available options to load the dataset.load(name, **kwargs)
: Return the given dataset as datapipe, in which each sample is a dictionary containing all the information.register()
: This is used to register datasets (more on what this actually is later) with our API. Sincedatasets.load()
is a thin convenience wrapper, we might also not expose this and only use it for builtin datasets (which are not part pf this PR).So the workflow that currently looks like this
will now look like
(note that the example cannot be run yet, because there are no built-in datasets in this PR)
How do we implement the new style datasets?
There are three closely connected objects that are all importable from
torchvision.prototype.datasets.utils
. The first two will only be used in most cases:DatasetInfo
: Static information about the dataset. Seedatasets.info()
above.DatasetConfig
: Namespace containing the configuration of the dataset, for example which split to use.The last object,
Dataset
, is actually the one that actually holds the information about how to load the dataset. It has three abstract methods that need to be overwritten:info
(property): Returns theDatasetInfo
with the static information about the dataset. This is for example used to instantiate theDatasetConfig
from the additional parameters indatasets.load()
resources
: Returns a list ofOnlineResource
's that need to be present to load the currentconfig
. They will be downloaded and checked automatically._make_datapipe
: The heart of the implementation. The method receives a list of datapipes, which correspond to the output ofresources
, the current config, as well as optionally a shuffler and decoder. Implementation is highly dataset specific, so examples for this will only be showcased in follow-up PRs that add some fundamental datasets.cc @pmeier @mthrok