Skip to content

allow preprocessed mock data in prototype datasets tests #5706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 30, 2022

Currently, we enforce that the mock data functions for the prototype datasets tests generate the exact files needed for the given config. Some datasets however perform non-standard preprocessing.

For example, the SBU archive (#5683 @lezwon) only provides a list of URLs that need to be downloaded by us before we can start yielding samples. Since we cannot perform this download in the test suite (time consuming and would fail for Meta internal systems), the test suite needs to be able to also accept the preprocessed version of the file.

This PR adds this capability. Plus it also enforces that no extra files are generated by the mock data function to avoid problems like #5549.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 30, 2022

💊 CI failures summary and remediations

As of commit 9d54c15 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pmeier pmeier mentioned this pull request Mar 31, 2022
@lezwon
Copy link
Contributor

lezwon commented Apr 3, 2022

@pmeier any update on when this can be merged? I'd like to continue on #5683 as soon as it's done.

@NicolasHug
Copy link
Member

the SBU archive (#5683 @lezwon) only provides a list of URLs that need to be downloaded by us before we can start yielding samples

Does that mean that most of the downloading happens in __iter__()?

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 4, 2022

Does that mean that most of the downloading happens in __iter__()?

Nope, the preprocess happens before the resource is loaded the same way as preprocess="decompress". This changes nothing about the paradigm to download everything upfront.

Comment on lines 80 to 89
if extra_file_names:
for extra in extra_file_names.copy():
candidate_pattern = re.compile(fr"^{extra.split('.', 1)[0]}([.]\w+)*$")
try:
missing = next(missing for missing in missing_file_names if candidate_pattern.match(missing))
except StopIteration:
continue

extra_file_names.remove(extra)
missing_file_names.remove(missing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain what this does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded the comment above the block. LMK if it is sufficient.

Copy link
Member

@NicolasHug NicolasHug Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I still don't understand the regex (do we really need a regex?), nor why the raw data ends up in missing_file_names.

The problem we're having seems reasonably simple, yet the code to handle it looks fairly complex. Is there something we can do to simplify the logic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the raw data ends up in missing_file_names.

The required_file_names include the name of the resources defined on the dataset, while the available_file_names just glob the directory. So if the dataset defines OnlineResource(file_name="foo.tar"), but we only find a folder foo, foo.tar will be in missing_file_names while foo will be in extra_file_names. This code block finds this correspondence and subsequently removes both from the sets.

Is there something we can do to simplify the logic here?

Will try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example you describe, foo.tar was removed during the "preprocessing" stage? Or just not created at all in the mock because we don't actually need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, is the special casing worth it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here? Do you want a general solution on dataset side? Because for the test suite this PR is a general solution for all datasets that have a similar setup. Nothing is special about SBU here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is a general solution for all datasets that have a similar setup

How many datasets have a similar setup? From above it seems like this is the only one. Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prototype for loading inside _make_dp is in lezwon#1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many datasets have a similar setup? From above it seems like this is the only one.

Nope, at least Kinetics also has a similar setup:

with open(split_url_filepath) as file:
list_video_urls = [urllib.parse.quote(line, safe="/,:") for line in file.read().splitlines()]
if self.num_download_workers == 1:
for line in list_video_urls:
download_and_extract_archive(line, tar_path, self.split_folder)
else:
part = partial(_dl_wrap, tar_path, self.split_folder)
poolproc = Pool(self.num_download_workers)
poolproc.map(part, list_video_urls)

Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.

What is the alternative? Dropping support for the dataset? We might be able to argue that SBU is not important enough to "break our rules", but Kinetics certainly is.

@pmeier pmeier requested a review from NicolasHug April 4, 2022 10:10
Comment on lines 81 to 85
for missing in missing_file_names.copy():
extra_candidate = missing.split(".", 1)[0]
if extra_candidate in extra_file_names:
missing_file_names.remove(missing)
extra_file_names.remove(extra_candidate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entire thing equivalent to this?

missing = set(missing.split(".")[0] for missing in missing_file_names)
not_actually_missing = missing & extra_file_names

missing -= not_actually_missing
extra_file_names -= not_actually_missing

# I guess we need to change `if missing_file_names` into `if_missing` below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic behind it is correct, but missing -= not_actually_missing does not work. not_actually_missing contains the file names without suffixes, but missing has them with suffixes. Thus, we wouldn't remove anything here.

Copy link
Member

@NicolasHug NicolasHug Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but missing has them with suffixes

No, missing is missing_file_names without the suffixes (the name isn't great, I agree)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I missed that. In that case the error message is less expressive since it will now say "foo is missing" although it should say "foo.tar" is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what we're doing is really equivalent to the snippet above, then I can't make sense out of it.

A = available
R = required
M = missing = R - A
E = extra = A - R

So the intersection between M and E is null, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. And that is why we perform this matching. For example

A = {"foo", "bar"}
R = {"foo.zip", "spam.zip"}
M = R - A  # {"foo.zip", "spam.zip"}
E = A - R  # {"foo", "bar"}

missing = next(iter(M))  # foo.zip
extra_candidate = missing_candidate.split(".")[0]  # foo
extra_candidate in E  # True
M.remove(missing)  # {"spam.zip"}
E.remove(extra_candidate)  # {"bar"}

Comment on lines 78 to 79
# This detects these cases and removes the corresponding entries from the sets since the files are neither extra
# nor missing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not really explain whre the missing.split(".", 1)[0] logic comes from. Is this check very specific to the SBU dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just stripping all suffixes. So, no, this is not specific to SBU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still assumes that the SBUCaptionedPhotoDataset.tar.gz contains the link to download resources that will be put into the SBUCaptionedPhotoDataset folder, with a matching name, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folder SBUCaptionedPhotoDataset is a superset of SBUCaptionedPhotoDataset.tar.gz. It contains all files the archive contains plus the downloaded images. So, yes, the URL list is also contained in the folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants