simplify OnlineResource.load #5990

pmeier · 2022-05-11T13:09:25Z

This simplifies the implementation of OnlineResource.load by quite a bit. The only functional difference is that we now only execute the preprocessing after the download. For the admittedly niche situation that the user downloads the data manually, they now also need to preprocess them manually.

In addition, this PR adds a bunch of tests.

pmeier · 2022-05-11T14:49:38Z

The only functional difference is that we now only execute the preprocessing after the download.

I was mistaken and that also has some implications for us. The current datasets tests are set up in way that they inject a mock of the "raw" downloaded file into the root directory of the dataset. Previously, the loading logic of the resources detected this at runtime and performed the preprocessing.

Without this we get the failures as seen in this CI run. The PCAM dataset expects gzip decompressed resources

vision/torchvision/prototype/datasets/_builtin/pcam.py

Line 108 in 40a0ab7

    
           GDriveResource(file_name=file_name, id=gdrive_id, sha256=sha256, preprocess="decompress")

but with the implementation in this PR, it will get the compressed ones.

I see two ways out here:

Reinstate the functionality to always perform the preprocessing if available and we only detect the raw version of the file.
Change our test suite to provide the "downloaded" and preprocessed data rather than only the "downloaded" one.

NicolasHug · 2022-05-11T15:03:46Z

Can we mock the download logic instead, and replace it with dataset_mock.prepare()?

pmeier · 2022-05-11T20:49:21Z

Yes, we can mock

vision/torchvision/prototype/datasets/utils/_resource.py

Line 125 in 40a0ab7

    
           def download(self, root: Union[str, pathlib.Path], *, skip_integrity_check: bool = False) -> pathlib.Path:

Still, that will require more changes in the test suite. Currently we use the following idiom

vision/test/test_prototype_builtin_datasets.py

Lines 57 to 60 in 40a0ab7

    
           def test_smoke(self, test_home, dataset_mock, config): 
        
               dataset_mock.prepare(test_home, config) 
        
               dataset = datasets.load(dataset_mock.name, **config)

If we want to patch in prepare we have two options:

Use the mocker fixture and pass it to prepare together with the test_home. The patch will automatically be reset for every run of the test.
Use unittest.mock. The only way this automatically resets is to patch with a context manager. We could do something like
```
with dataset_mock.prepare(test_home, config):
    dataset = datasets.load(dataset_mock.name, **config) 
```

Another option is to merge the two calls into one like

dataset, mock_info = dataset_mock.load(test_home, config)

I've checked, we are always using the two calls together.

Thoughts?

pmeier

@NicolasHug in d627479 I added a PoC for merging the mock data preparation as proposed in #5990 (comment).

test/builtin_dataset_mocks.py

NicolasHug

Thanks Philip, I left a few comments, LMK what you think

NicolasHug · 2022-05-12T10:43:03Z

test/builtin_dataset_mocks.py

        # `datasets.home()` is patched to a temporary directory through the autouse fixture `test_home` in
        # test/test_prototype_builtin_datasets.py
        root = pathlib.Path(datasets.home()) / self.name
-        root.mkdir(exist_ok=True)
+        mock_data_folder = root / "__mock__"


Nit: I think it's worth mentioning that it's a tmp dir in its name. Here's a suggestion below but feel free to change

Suggested change

mock_data_folder = root / "__mock__"

tmp_mock_data_folder = root / "__mock__"

as we discussed offline, we could also use a more global home() / cache / folder where each sub-folder would contain the data for a specfic dataset and config. Looking at our config, look like they should be hashable fairly easily. But we should definitely leave this for another PR.

as we discussed offline, we could also use a more global home() / cache / folder

We can, but it can't be within home(), since that changes with every call.

I assume we can make the test_home() fixture global for the test run?

test/builtin_dataset_mocks.py

test/test_prototype_datasets_utils.py

torchvision/prototype/datasets/utils/_resource.py

This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7.

This reverts commit d627479.

pmeier · 2022-05-13T10:41:51Z

The changes needed to the test suited were factored out in #6010. That will need to be merged first before we continue here.

NicolasHug · 2022-05-16T09:38:54Z

test/test_prototype_datasets_utils.py

+    def test_preprocess_extract(self, tmp_path):
+        files = None
+
+        def download_fn(resource, root):
+            nonlocal files
+            archive, files = self._make_tar(root, name=resource.file_name)
+            return archive
+
+        resource = self.DummyResource(file_name="folder.tar", preprocess="extract", download_fn=download_fn)
+
+        dp = resource.load(tmp_path)
+        assert files is not None, "`download_fn()` was never called"
+        assert isinstance(dp, FileOpener)
+
+        actual = {path: buffer.read().decode() for path, buffer in dp}
+        expected = {
+            path.replace(resource.file_name, resource.file_name.split(".")[0]): content
+            for path, content in files.items()
+        }
+        assert actual == expected


This is fairly complex TBH. The nonlocal logic, the fact that _make_tar returns a dict of files, etc.

IIUC the goal of this check is to make sure the file gets properly extracted. Surely there are simpler ways to assert that?

Of course there are other ways, but I'm not sure they are easier. We can't create the archive upfront in tmp_path, because in that case we would never trigger the preprocessing. One option would be to create the data upfront in a temporary directory and move it to tmp_path inside the download function similar to what we are doing with the actual resource loading in our dataset tests (#6010).

Given that you were ok with nonlocal in #5990 (comment), I don't think it will be much simpler. You choose.

test/test_prototype_datasets_utils.py

torchvision/prototype/datasets/utils/_resource.py

NicolasHug

Thanks @pmeier , LGTM

NicolasHug · 2022-05-17T10:34:24Z

torchvision/prototype/datasets/utils/_resource.py

@@ -32,7 +32,7 @@ def __init__(
        *,
        file_name: str,
        sha256: Optional[str] = None,
-        preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None,
+        preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None,


For my own education, why did we need to specify None in the annotation here? I assume Optional[] would have been enough - is it just to be more explicit about it?

It becomes clearer when we let black explode the annotation:

Suggested change

preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None,

preprocess: Optional[

Union[

Literal["decompress", "extract"],

Callable[[pathlib.Path], None],

]

] = None,

I only changed the return type of the callable from pathlib.Path to None since we refactored the loading logic that the return is no longer needed.

Still, in general you are right. Optional[Foo] is equivalent to Union[None, Foo]. Plus, Optional[Union[Foo, Bar]] can be flattened to Union[None, Foo, Bar]. It is just my personal preference to be explicit about Optional in case it actually means an optional value. In case None is just another valid value to pass, I prefer to merge it into a Union.

Summary: * simplify OnlineResource.load * [PoC] merge mock data preparation and loading * Revert "cache mock data based on config" This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7. * Revert "[PoC] merge mock data preparation and loading" This reverts commit d627479. * remove preprocess returning a new path in favor of querying twice * address test comments * clarify comment * mypy * use builtin decompress utility Reviewed By: NicolasHug Differential Revision: D36760923 fbshipit-source-id: 1d3d30be96c3226fc181c4654208b2d3c6fdf7cb

simplify OnlineResource.load

cb774b7

pmeier added enhancement module: datasets prototype labels May 11, 2022

pmeier requested a review from NicolasHug May 11, 2022 13:09

facebook-github-bot added the cla signed label May 11, 2022

pmeier mentioned this pull request May 12, 2022

rely on patched datasets home rather than passing it around #5998

Merged

pmeier added 2 commits May 12, 2022 11:32

Merge branch 'main' into simplify-resource-load

5db94a8

[PoC] merge mock data preparation and loading

d627479

pmeier commented May 12, 2022

View reviewed changes

test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved

NicolasHug reviewed May 12, 2022

View reviewed changes

Revert "cache mock data based on config"

99c2daf

This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7.

pmeier mentioned this pull request May 13, 2022

Merge mock data preparation and dataset logic in prototype tests #6010

Merged

Revert "[PoC] merge mock data preparation and loading"

65198d1

This reverts commit d627479.

NicolasHug reviewed May 16, 2022

View reviewed changes

pmeier added 5 commits May 16, 2022 14:03

remove preprocess returning a new path in favor of querying twice

232c6a9

address test comments

89df201

Merge branch 'main' into simplify-resource-load

a836a83

clarify comment

e8ca146

mypy

c0ecb46

NicolasHug reviewed May 17, 2022

View reviewed changes

test/test_prototype_datasets_utils.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/utils/_resource.py Outdated Show resolved Hide resolved

NicolasHug approved these changes May 17, 2022

View reviewed changes

pmeier added 2 commits May 17, 2022 13:22

Merge branch 'main' into simplify-resource-load

a703d44

use builtin decompress utility

82af340

pmeier merged commit b430ba6 into pytorch:main May 17, 2022

pmeier deleted the simplify-resource-load branch May 17, 2022 13:45

pmeier mentioned this pull request May 20, 2022

Replace torchvision.datasets.utils with functionality from torchdata #6060

Draft

	mock_data_folder = root / "__mock__"
	tmp_mock_data_folder = root / "__mock__"

simplify OnlineResource.load #5990

simplify OnlineResource.load #5990

Uh oh!

Conversation

pmeier commented May 11, 2022

Uh oh!

pmeier commented May 11, 2022

Uh oh!

NicolasHug commented May 11, 2022

Uh oh!

pmeier commented May 11, 2022

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 12, 2022

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 12, 2022

Choose a reason for hiding this comment

Uh oh!

pmeier May 12, 2022

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmeier commented May 13, 2022

Uh oh!

NicolasHug May 16, 2022

Choose a reason for hiding this comment

Uh oh!

pmeier May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 17, 2022

Choose a reason for hiding this comment

Uh oh!

pmeier May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmeier May 16, 2022 •

edited

Loading

pmeier May 17, 2022 •

edited

Loading