Skip to content

Provide the meta.bin file of the ImageNet dataset together with torchvision? #1647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pmeier opened this issue Dec 7, 2019 · 5 comments
Open

Comments

@pmeier
Copy link
Collaborator

pmeier commented Dec 7, 2019

@fmassa In the light of recent problems with the meta.bin file of the ImageNet dataset (#1645 #1646 ), I think it is reasonable to ask, if we can provide it together with torchvision. Especially now without official download links for the archives, I think it would be beneficial. With it users that switch to torchvision and only have the image archives do not need to download the devkit.

@pmeier pmeier changed the title Provide the mea.bin file of the ImageNet dataset together with torchvision? Provide the meta.bin file of the ImageNet dataset together with torchvision? Dec 7, 2019
@fmassa
Copy link
Member

fmassa commented Dec 10, 2019

I'm not sure if we can distribute the meta.bin ourselves.

But there might be a way of getting the same information from the imagenet website, without having to download the full dataset.

For example, looking a bit around, I was able to find the synsets in http://www.image-net.org/api/text/imagenet.synset.obtain_synset_list

Maybe having a closer look at http://image-net.org/download-API might give some hints on what to do?

@pmeier
Copy link
Collaborator Author

pmeier commented Dec 10, 2019

I'm not sure if we can distribute the meta.bin ourselves.

You mean due to licensing issues or something else? The content meta.bin is twofold:

  1. Mapping from WordNet IDs to human-readable classes: Maybe we could obtain this information directly. I'll look into the API link you posted. Since this is only for convenience, we could simply make this optional for users that simply want to train on the images rather than debug individual classes.
  2. WordNet IDs of the validation set: We are already providing this information, albeit in the pytorch/examples repository:

https://github.com/raw/soumith/imagenetloader.torch/master/valprep.sh

After creating the directories, each validation image is moved into the respective folder. If we for whatever reason cannot provide the same information here, we could simply parse this file.

For example, looking a bit around, I was able to find the synsets in http://www.image-net.org/api/text/imagenet.synset.obtain_synset_list

That list has 21841 entries. Without further investigating, I think these are simply all available classes of WordNet. In the ImageNet dataset we have 1000 classes and 50000 validation images.

@fmassa
Copy link
Member

fmassa commented Dec 10, 2019

@pmeier yes, this might be a problem due to licensing issues.

Maybe there is a way of getting the meta.bin from the official ImageNet website, without requiring to download the full dataset (and thus skipping the registration)?

@collinmccarthy
Copy link

@fmassa @pmeier I'm assuming the goal here is simply to be able to instantiate an ImageNet instance without the meta.bin file initially being available? Assuming the user has downloaded the devkit tar.gz into the root dir, can't the ImageNet constructor call parse_devkit_archive() if the meta.bin file doesn't exist? I'm pretty sure that would create it and then everything else would just work, but maybe I'm missing something here (license related or otherwise).

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 13, 2020

@collinmccarthy

The functionality you describe is already implemented

def __init__(self, root, split='train', download=None, **kwargs):
if download is True:
msg = ("The dataset is no longer publicly accessible. You need to "
"download the archives externally and place them in the root "
"directory.")
raise RuntimeError(msg)
elif download is False:
msg = ("The use of the download flag is deprecated, since the dataset "
"is no longer publicly accessible.")
warnings.warn(msg, RuntimeWarning)
root = self.root = os.path.expanduser(root)
self.split = verify_str_arg(split, "split", ("train", "val"))
self.parse_archives()

I think these problems arise because the users don't know that there is a devkit and thus don't download it or know there is a devkit but think they don't need it. This was no issue when we could download it for them, but since the download links have been closed, this has been popping up.


@fmassa

I've looked around, but I can't find the information we need online. Maybe you missed that in my former post, but is there a reason to not use the file

https://github.com/raw/soumith/imagenetloader.torch/master/valprep.sh ?

Licensing should not be problem, since we are already hosting it and also using it as part of the official ImageNet example. This is easily parsed and contains enough information to use the ImageNet dataset without downloading the devkit.

from contextlib import contextmanager
from os import path
import shutil
import tempfile
import re
from torchvision.datasets.utils import download_url

PATTERN = re.compile("mv ILSVRC2012_val_000(?P<idx>\d{5}).JPEG (?P<wnid>n\d{8})/")
URL = "https://github.com/raw/soumith/imagenetloader.torch/master/valprep.sh"

@contextmanager
def get_tmp_dir(**kwargs):
    tmp_dir = tempfile.mkdtemp(**kwargs)
    try:
        yield tmp_dir
    finally:
        shutil.rmtree(tmp_dir)

with get_tmp_dir() as tmp_dir:
    download_url(URL, tmp_dir)
    with open(path.join(tmp_dir, path.basename(URL)), "r") as fh:
        lines = fh.readlines()

data = []
for line in lines:
    match = PATTERN.match(line.strip())
    if match is None:
        continue

    idx = int(match.group("idx"))
    wnid = match.group("wnid")
    data.append((idx, wnid))

_, val_wnids,  = zip(*sorted(data))

The other component of the meta.bin file is just for convenience and not needed if you just want to train / validate on the images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants