-
Notifications
You must be signed in to change notification settings - Fork 7.1k
simplify OnlineResource.load #5990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I was mistaken and that also has some implications for us. The current datasets tests are set up in way that they inject a mock of the "raw" downloaded file into the root directory of the dataset. Previously, the loading logic of the resources detected this at runtime and performed the preprocessing. Without this we get the failures as seen in this CI run. The PCAM dataset expects gzip decompressed resources
but with the implementation in this PR, it will get the compressed ones. I see two ways out here:
|
Can we mock the |
Yes, we can mock
Still, that will require more changes in the test suite. Currently we use the following idiom vision/test/test_prototype_builtin_datasets.py Lines 57 to 60 in 40a0ab7
If we want to patch in
Another option is to merge the two calls into one like dataset, mock_info = dataset_mock.load(test_home, config) I've checked, we are always using the two calls together. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug in d627479 I added a PoC for merging the mock data preparation as proposed in #5990 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Philip, I left a few comments, LMK what you think
test/builtin_dataset_mocks.py
Outdated
# `datasets.home()` is patched to a temporary directory through the autouse fixture `test_home` in | ||
# test/test_prototype_builtin_datasets.py | ||
root = pathlib.Path(datasets.home()) / self.name | ||
root.mkdir(exist_ok=True) | ||
mock_data_folder = root / "__mock__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it's worth mentioning that it's a tmp dir in its name. Here's a suggestion below but feel free to change
mock_data_folder = root / "__mock__" | |
tmp_mock_data_folder = root / "__mock__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed offline, we could also use a more global home() / cache /
folder where each sub-folder would contain the data for a specfic dataset and config. Looking at our config, look like they should be hashable fairly easily. But we should definitely leave this for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed offline, we could also use a more global
home() / cache /
folder
We can, but it can't be within home()
, since that changes with every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we can make the test_home()
fixture global for the test run?
This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7.
This reverts commit d627479.
The changes needed to the test suited were factored out in #6010. That will need to be merged first before we continue here. |
def test_preprocess_extract(self, tmp_path): | ||
files = None | ||
|
||
def download_fn(resource, root): | ||
nonlocal files | ||
archive, files = self._make_tar(root, name=resource.file_name) | ||
return archive | ||
|
||
resource = self.DummyResource(file_name="folder.tar", preprocess="extract", download_fn=download_fn) | ||
|
||
dp = resource.load(tmp_path) | ||
assert files is not None, "`download_fn()` was never called" | ||
assert isinstance(dp, FileOpener) | ||
|
||
actual = {path: buffer.read().decode() for path, buffer in dp} | ||
expected = { | ||
path.replace(resource.file_name, resource.file_name.split(".")[0]): content | ||
for path, content in files.items() | ||
} | ||
assert actual == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fairly complex TBH. The nonlocal
logic, the fact that _make_tar
returns a dict of files, etc.
IIUC the goal of this check is to make sure the file gets properly extracted. Surely there are simpler ways to assert that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course there are other ways, but I'm not sure they are easier. We can't create the archive upfront in tmp_path
, because in that case we would never trigger the preprocessing. One option would be to create the data upfront in a temporary directory and move it to tmp_path
inside the download function similar to what we are doing with the actual resource loading in our dataset tests (#6010).
Given that you were ok with nonlocal
in #5990 (comment), I don't think it will be much simpler. You choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmeier , LGTM
@@ -32,7 +32,7 @@ def __init__( | |||
*, | |||
file_name: str, | |||
sha256: Optional[str] = None, | |||
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None, | |||
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education, why did we need to specify None
in the annotation here? I assume Optional[]
would have been enough - is it just to be more explicit about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It becomes clearer when we let black
explode the annotation:
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None, | |
preprocess: Optional[ | |
Union[ | |
Literal["decompress", "extract"], | |
Callable[[pathlib.Path], None], | |
] | |
] = None, |
I only changed the return type of the callable from pathlib.Path
to None
since we refactored the loading logic that the return is no longer needed.
Still, in general you are right. Optional[Foo]
is equivalent to Union[None, Foo]
. Plus, Optional[Union[Foo, Bar]]
can be flattened to Union[None, Foo, Bar]
. It is just my personal preference to be explicit about Optional
in case it actually means an optional value. In case None
is just another valid value to pass, I prefer to merge it into a Union
.
Summary: * simplify OnlineResource.load * [PoC] merge mock data preparation and loading * Revert "cache mock data based on config" This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7. * Revert "[PoC] merge mock data preparation and loading" This reverts commit d627479. * remove preprocess returning a new path in favor of querying twice * address test comments * clarify comment * mypy * use builtin decompress utility Reviewed By: NicolasHug Differential Revision: D36760923 fbshipit-source-id: 1d3d30be96c3226fc181c4654208b2d3c6fdf7cb
This simplifies the implementation of
OnlineResource.load
by quite a bit. The only functional difference is that we now only execute the preprocessing after the download. For the admittedly niche situation that the user downloads the data manually, they now also need to preprocess them manually.In addition, this PR adds a bunch of tests.