Skip to content

Port SBU dataset #5683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Port SBU dataset #5683

wants to merge 12 commits into from

Conversation

lezwon
Copy link
Contributor

@lezwon lezwon commented Mar 26, 2022

fixes #5349

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 26, 2022

💊 CI failures summary and remediations

As of commit 91bd94f (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build binary_linux_wheel_py3.7_rocm4.3.1 (1/1)

Step: "packaging/build_wheel.sh" (full log | diagnosis details | 🔁 rerun)

https://repo.ius.io/7/x86_64/repodata/repomd.xm...- "Peer reports it experienced an internal error."

     5. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=ius.skip_if_unavailable=true

failure: repodata/repomd.xml from ius: [Errno 256] No more mirrors to try.
https://repo.ius.io/7/x86_64/repodata/repomd.xml: [Errno 14] curl#35 - "Peer reports it experienced an internal error."


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@lezwon
Copy link
Contributor Author

lezwon commented Mar 26, 2022

@pmeier mind checking this PR and letting me know if I'm headed in the right direction?

@pmeier pmeier self-requested a review March 27, 2022 17:31
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I knew there was a reason I marked SBU in the tracker issue. When I gave you the go I just saw that there is something to download and assumed I was mistaken before. That is on me, my bad.

In general your solution looks really good and works, but has one major downside: We need to re-download every image for every iteration. While we plan to support streaming datasets from the internet, we are not there yet. Thus, we need to download everything.

That could be achieved with a OnDiskCacheHolder, but that would mean we would only download at runtime. All current datasets download everything upfront and I would keep it that way for now.

My solution is to put a custom preprocess method onto the resource and download everything there.

@lezwon lezwon marked this pull request as draft March 29, 2022 02:07
@lezwon lezwon marked this pull request as ready for review March 29, 2022 07:44
@pmeier pmeier self-requested a review March 29, 2022 08:13
@lezwon lezwon changed the title [WIP] Port SBU dataset Port SBU dataset Mar 29, 2022
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks @lezwon! I have one simplification comment and one larger change for the mock data generation.

with open(dataset_folder.joinpath(photo_urls_file), "w") as url_file, open(
dataset_folder.joinpath(photo_captions_file), "w"
) as caption_file:
urls = [f"https://via.placeholder.com/{random.randint(100, 1000)}.jpg" for _ in range(num_samples)]
Copy link
Collaborator

@pmeier pmeier Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool idea and I'm definitely going to use this webiste for other things in the future 🚀 Unfortunately, we cannot have an actual download during mock data generation for two reasons:

  1. Downloading these images takes quite some time and we want the tests to be fast.
  2. Meta internal test system do not have access to the internet and thus would fail here.

I propose I send a patch for the test suite that allows us to also only generate the already preprocessed files. Thus, we only add a SBUCaptionedPhotoDataset that already includes test images. I'll ping you on the PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5706.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier I'll wait for the PR to get merged right? I can make the necessary changes after it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for the delay. I'll try to get it merged soon.

@lezwon lezwon force-pushed the 5349_sbu_dataset branch from 34b9775 to 91bd94f Compare March 31, 2022 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SBU
4 participants