-
Notifications
You must be signed in to change notification settings - Fork 7.1k
allow preprocessed mock data in prototype datasets tests #5706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
💊 CI failures summary and remediationsAs of commit 9d54c15 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Nope, the |
test/builtin_dataset_mocks.py
Outdated
if extra_file_names: | ||
for extra in extra_file_names.copy(): | ||
candidate_pattern = re.compile(fr"^{extra.split('.', 1)[0]}([.]\w+)*$") | ||
try: | ||
missing = next(missing for missing in missing_file_names if candidate_pattern.match(missing)) | ||
except StopIteration: | ||
continue | ||
|
||
extra_file_names.remove(extra) | ||
missing_file_names.remove(missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what this does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded the comment above the block. LMK if it is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still don't understand the regex (do we really need a regex?), nor why the raw data ends up in missing_file_names.
The problem we're having seems reasonably simple, yet the code to handle it looks fairly complex. Is there something we can do to simplify the logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the raw data ends up in missing_file_names.
The required_file_names
include the name of the resources defined on the dataset, while the available_file_names
just glob the directory. So if the dataset defines OnlineResource(file_name="foo.tar")
, but we only find a folder foo
, foo.tar
will be in missing_file_names
while foo
will be in extra_file_names
. This code block finds this correspondence and subsequently removes both from the sets.
Is there something we can do to simplify the logic here?
Will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example you describe, foo.tar
was removed during the "preprocessing" stage? Or just not created at all in the mock because we don't actually need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, is the special casing worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here? Do you want a general solution on dataset side? Because for the test suite this PR is a general solution for all datasets that have a similar setup. Nothing is special about SBU here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR is a general solution for all datasets that have a similar setup
How many datasets have a similar setup? From above it seems like this is the only one. Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototype for loading inside _make_dp
is in lezwon#1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many datasets have a similar setup? From above it seems like this is the only one.
Nope, at least Kinetics also has a similar setup:
vision/torchvision/datasets/kinetics.py
Lines 181 to 190 in 095cabb
with open(split_url_filepath) as file: | |
list_video_urls = [urllib.parse.quote(line, safe="/,:") for line in file.read().splitlines()] | |
if self.num_download_workers == 1: | |
for line in list_video_urls: | |
download_and_extract_archive(line, tar_path, self.split_folder) | |
else: | |
part = partial(_dl_wrap, tar_path, self.split_folder) | |
poolproc = Pool(self.num_download_workers) | |
poolproc.map(part, list_video_urls) |
Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.
What is the alternative? Dropping support for the dataset? We might be able to argue that SBU is not important enough to "break our rules", but Kinetics certainly is.
test/builtin_dataset_mocks.py
Outdated
for missing in missing_file_names.copy(): | ||
extra_candidate = missing.split(".", 1)[0] | ||
if extra_candidate in extra_file_names: | ||
missing_file_names.remove(missing) | ||
extra_file_names.remove(extra_candidate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this entire thing equivalent to this?
missing = set(missing.split(".")[0] for missing in missing_file_names)
not_actually_missing = missing & extra_file_names
missing -= not_actually_missing
extra_file_names -= not_actually_missing
# I guess we need to change `if missing_file_names` into `if_missing` below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic behind it is correct, but missing -= not_actually_missing
does not work. not_actually_missing
contains the file names without suffixes, but missing
has them with suffixes. Thus, we wouldn't remove anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but missing has them with suffixes
No, missing
is missing_file_names
without the suffixes (the name isn't great, I agree)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I missed that. In that case the error message is less expressive since it will now say "foo
is missing" although it should say "foo.tar" is missing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If what we're doing is really equivalent to the snippet above, then I can't make sense out of it.
A = available
R = required
M = missing = R - A
E = extra = A - R
So the intersection between M and E is null, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. And that is why we perform this matching. For example
A = {"foo", "bar"}
R = {"foo.zip", "spam.zip"}
M = R - A # {"foo.zip", "spam.zip"}
E = A - R # {"foo", "bar"}
missing = next(iter(M)) # foo.zip
extra_candidate = missing_candidate.split(".")[0] # foo
extra_candidate in E # True
M.remove(missing) # {"spam.zip"}
E.remove(extra_candidate) # {"bar"}
test/builtin_dataset_mocks.py
Outdated
# This detects these cases and removes the corresponding entries from the sets since the files are neither extra | ||
# nor missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not really explain whre the missing.split(".", 1)[0]
logic comes from. Is this check very specific to the SBU dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just stripping all suffixes. So, no, this is not specific to SBU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still assumes that the SBUCaptionedPhotoDataset.tar.gz
contains the link to download resources that will be put into the SBUCaptionedPhotoDataset
folder, with a matching name, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The folder SBUCaptionedPhotoDataset
is a superset of SBUCaptionedPhotoDataset.tar.gz
. It contains all files the archive contains plus the downloaded images. So, yes, the URL list is also contained in the folder.
Currently, we enforce that the mock data functions for the prototype datasets tests generate the exact files needed for the given config. Some datasets however perform non-standard preprocessing.
For example, the SBU archive (#5683 @lezwon) only provides a list of URLs that need to be downloaded by us before we can start yielding samples. Since we cannot perform this download in the test suite (time consuming and would fail for Meta internal systems), the test suite needs to be able to also accept the preprocessed version of the file.
This PR adds this capability. Plus it also enforces that no extra files are generated by the mock data function to avoid problems like #5549.