-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Port SBU dataset #5683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Port SBU dataset #5683
Conversation
💊 CI failures summary and remediationsAs of commit 91bd94f (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
@pmeier mind checking this PR and letting me know if I'm headed in the right direction? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, I knew there was a reason I marked SBU in the tracker issue. When I gave you the go I just saw that there is something to download and assumed I was mistaken before. That is on me, my bad.
In general your solution looks really good and works, but has one major downside: We need to re-download every image for every iteration. While we plan to support streaming datasets from the internet, we are not there yet. Thus, we need to download everything.
That could be achieved with a OnDiskCacheHolder
, but that would mean we would only download at runtime. All current datasets download everything upfront and I would keep it that way for now.
My solution is to put a custom preprocess
method onto the resource and download everything there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks @lezwon! I have one simplification comment and one larger change for the mock data generation.
with open(dataset_folder.joinpath(photo_urls_file), "w") as url_file, open( | ||
dataset_folder.joinpath(photo_captions_file), "w" | ||
) as caption_file: | ||
urls = [f"https://via.placeholder.com/{random.randint(100, 1000)}.jpg" for _ in range(num_samples)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really cool idea and I'm definitely going to use this webiste for other things in the future 🚀 Unfortunately, we cannot have an actual download during mock data generation for two reasons:
- Downloading these images takes quite some time and we want the tests to be fast.
- Meta internal test system do not have access to the internet and thus would fail here.
I propose I send a patch for the test suite that allows us to also only generate the already preprocessed files. Thus, we only add a SBUCaptionedPhotoDataset
that already includes test images. I'll ping you on the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5706.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier I'll wait for the PR to get merged right? I can make the necessary changes after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for the delay. I'll try to get it merged soon.
Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
34b9775
to
91bd94f
Compare
Co-authored-by: Nicolas Hug <[email protected]>
fixes #5349