allow preprocessed mock data in prototype datasets tests #5706

pmeier · 2022-03-30T12:28:35Z

Currently, we enforce that the mock data functions for the prototype datasets tests generate the exact files needed for the given config. Some datasets however perform non-standard preprocessing.

For example, the SBU archive (#5683 @lezwon) only provides a list of URLs that need to be downloaded by us before we can start yielding samples. Since we cannot perform this download in the test suite (time consuming and would fail for Meta internal systems), the test suite needs to be able to also accept the preprocessed version of the file.

This PR adds this capability. Plus it also enforces that no extra files are generated by the mock data function to avoid problems like #5549.

facebook-github-bot · 2022-03-30T12:28:42Z

💊 CI failures summary and remediations

As of commit 9d54c15 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lezwon · 2022-04-03T12:47:52Z

@pmeier any update on when this can be merged? I'd like to continue on #5683 as soon as it's done.

NicolasHug · 2022-04-04T08:44:59Z

the SBU archive (#5683 @lezwon) only provides a list of URLs that need to be downloaded by us before we can start yielding samples

Does that mean that most of the downloading happens in __iter__()?

pmeier · 2022-04-04T08:46:58Z

Does that mean that most of the downloading happens in __iter__()?

Nope, the preprocess happens before the resource is loaded the same way as preprocess="decompress". This changes nothing about the paradigm to download everything upfront.

NicolasHug · 2022-04-04T08:52:41Z

test/builtin_dataset_mocks.py

+        if extra_file_names:
+            for extra in extra_file_names.copy():
+                candidate_pattern = re.compile(fr"^{extra.split('.', 1)[0]}([.]\w+)*$")
+                try:
+                    missing = next(missing for missing in missing_file_names if candidate_pattern.match(missing))
+                except StopIteration:
+                    continue
+
+                extra_file_names.remove(extra)
+                missing_file_names.remove(missing)


Could you please explain what this does?

I've expanded the comment above the block. LMK if it is sufficient.

Sorry I still don't understand the regex (do we really need a regex?), nor why the raw data ends up in missing_file_names.

The problem we're having seems reasonably simple, yet the code to handle it looks fairly complex. Is there something we can do to simplify the logic here?

why the raw data ends up in missing_file_names.

The required_file_names include the name of the resources defined on the dataset, while the available_file_names just glob the directory. So if the dataset defines OnlineResource(file_name="foo.tar"), but we only find a folder foo, foo.tar will be in missing_file_names while foo will be in extra_file_names. This code block finds this correspondence and subsequently removes both from the sets.

Is there something we can do to simplify the logic here?

Will try.

In the example you describe, foo.tar was removed during the "preprocessing" stage? Or just not created at all in the mock because we don't actually need it?

Then, is the special casing worth it?

Not sure what you mean here? Do you want a general solution on dataset side? Because for the test suite this PR is a general solution for all datasets that have a similar setup. Nothing is special about SBU here.

this PR is a general solution for all datasets that have a similar setup

How many datasets have a similar setup? From above it seems like this is the only one. Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.

Prototype for loading inside _make_dp is in lezwon#1.

How many datasets have a similar setup? From above it seems like this is the only one.

Nope, at least Kinetics also has a similar setup:

vision/torchvision/datasets/kinetics.py

Lines 181 to 190 in 095cabb

with open(split_url_filepath) as file:

list_video_urls = [urllib.parse.quote(line, safe="/,:") for line in file.read().splitlines()]

if self.num_download_workers == 1:

for line in list_video_urls:

download_and_extract_archive(line, tar_path, self.split_folder)

else:

part = partial(_dl_wrap, tar_path, self.split_folder)

poolproc = Pool(self.num_download_workers)

poolproc.map(part, list_video_urls)

Hence I'm wondering whether having this custom logic, on which you and I have spent a few hours, already is really necessary.

What is the alternative? Dropping support for the dataset? We might be able to argue that SBU is not important enough to "break our rules", but Kinetics certainly is.

test/builtin_dataset_mocks.py

NicolasHug · 2022-04-04T11:57:14Z

test/builtin_dataset_mocks.py

+            for missing in missing_file_names.copy():
+                extra_candidate = missing.split(".", 1)[0]
+                if extra_candidate in extra_file_names:
+                    missing_file_names.remove(missing)
+                    extra_file_names.remove(extra_candidate)


Is this entire thing equivalent to this?

missing = set(missing.split(".")[0] for missing in missing_file_names) not_actually_missing = missing & extra_file_names missing -= not_actually_missing extra_file_names -= not_actually_missing # I guess we need to change `if missing_file_names` into `if_missing` below

The logic behind it is correct, but missing -= not_actually_missing does not work. not_actually_missing contains the file names without suffixes, but missing has them with suffixes. Thus, we wouldn't remove anything here.

but missing has them with suffixes

No, missing is missing_file_names without the suffixes (the name isn't great, I agree)

True, I missed that. In that case the error message is less expressive since it will now say "foo is missing" although it should say "foo.tar" is missing.

If what we're doing is really equivalent to the snippet above, then I can't make sense out of it.

A = available
R = required
M = missing = R - A
E = extra = A - R

So the intersection between M and E is null, right?

Correct. And that is why we perform this matching. For example

A = {"foo", "bar"} R = {"foo.zip", "spam.zip"} M = R - A # {"foo.zip", "spam.zip"} E = A - R # {"foo", "bar"} missing = next(iter(M)) # foo.zip extra_candidate = missing_candidate.split(".")[0] # foo extra_candidate in E # True M.remove(missing) # {"spam.zip"} E.remove(extra_candidate) # {"bar"}

NicolasHug · 2022-04-04T12:02:58Z

test/builtin_dataset_mocks.py

+        # This detects these cases and removes the corresponding entries from the sets since the files are neither extra
+        # nor missing.


This does not really explain whre the missing.split(".", 1)[0] logic comes from. Is this check very specific to the SBU dataset?

This is just stripping all suffixes. So, no, this is not specific to SBU.

This still assumes that the SBUCaptionedPhotoDataset.tar.gz contains the link to download resources that will be put into the SBUCaptionedPhotoDataset folder, with a matching name, right?

The folder SBUCaptionedPhotoDataset is a superset of SBUCaptionedPhotoDataset.tar.gz. It contains all files the archive contains plus the downloaded images. So, yes, the URL list is also contained in the folder.

allow preprocessed mock data in prototype datasets tests

1c89910

pmeier added module: datasets module: tests prototype labels Mar 30, 2022

pmeier requested a review from NicolasHug March 30, 2022 12:28

facebook-github-bot added the cla signed label Mar 30, 2022

pmeier mentioned this pull request Mar 31, 2022

Port SBU dataset #5683

Open

NicolasHug reviewed Apr 4, 2022

View reviewed changes

test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved

pmeier added 2 commits April 4, 2022 12:06

create images meta manually for COCO mock data

0163e1a

expand explanation for extra / missing file matching

95bafc4

pmeier requested a review from NicolasHug April 4, 2022 10:10

pmeier added 3 commits April 4, 2022 13:36

simplify detection logic

b20edac

Merge branch 'main' into dataset-preprocessed

903c3b1

revert unrelated change

40e232f

NicolasHug reviewed Apr 4, 2022

View reviewed changes

test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved

NicolasHug reviewed Apr 4, 2022

View reviewed changes

pmeier added 2 commits April 4, 2022 14:09

remove partial string splitting

2a6be38

rewrite explanation

9d54c15

pmeier mentioned this pull request Apr 4, 2022

move download from preprocessing to make_dp lezwon/vision#1

Open

pmeier mentioned this pull request May 12, 2022

simplify OnlineResource.load #5990

Merged

pmeier mentioned this pull request Aug 22, 2022

SBU download is failing #6390

Closed

	with open(split_url_filepath) as file:
	list_video_urls = [urllib.parse.quote(line, safe="/,:") for line in file.read().splitlines()]

	if self.num_download_workers == 1:
	for line in list_video_urls:
	download_and_extract_archive(line, tar_path, self.split_folder)
	else:
	part = partial(_dl_wrap, tar_path, self.split_folder)
	poolproc = Pool(self.num_download_workers)
	poolproc.map(part, list_video_urls)

		# This detects these cases and removes the corresponding entries from the sets since the files are neither extra
		# nor missing.

allow preprocessed mock data in prototype datasets tests #5706

Are you sure you want to change the base?

allow preprocessed mock data in prototype datasets tests #5706

Uh oh!

Conversation

pmeier commented Mar 30, 2022

Uh oh!

facebook-github-bot commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

lezwon commented Apr 3, 2022

Uh oh!

NicolasHug commented Apr 4, 2022

Uh oh!

pmeier commented Apr 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Mar 30, 2022 •

edited

Loading

NicolasHug Apr 4, 2022 •

edited

Loading

NicolasHug Apr 4, 2022 •

edited

Loading