Skip to content

Evaluation code of references is slightly off #4559

@NicolasHug

Description

@NicolasHug

There is a subtle known bug in the evaluation code of the classification references (and other references as well, but not all):

# FIXME need to take into account that the datasets
# could have been padded in distributed setup

It deserves some attention, because it's easy to miss and yet can impact our reported results, and those of research papers.

As the comment above describes, when computing the accuracy of the model on a validation set in a distributed setting, some images will be counted more than once if len(dataset) isn't divisible by batch_size * world_size 1.

On top of that, since the test_sampler uses shuffle=True by default, the duplicated images aren't even the same across executions, which means that evaluating the same model on the same dataset can lead to different results every time.

Should we try to fix this, or should we just leave it and wait for the new lightning recipes to handle it? And as a follow-up question, is there a builtin way in lightning to mitigate this at all? (I'm not familiar with lightning, so this one may not make sense.)

cc @datumbox

Footnotes

  1. For example if we have 10 images and 2 workers with a batch_size of 3, we will have something like:

    worker1: img1, img2, img3
    worker2: img4, img5, img6
    worker1: img7, img8, img9
    worker2: img10, **img1, img2** 
                      ^^^^^^^^^
     "padding": duplicated images which will affect the validation accuracy
    

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions