Closed
Description
There can be a bit of variance in the model evaluation, due to different things (see #4559, although the timeline can be a bit confusing to follow because I was relying on incorrect assumptions).
We addressed it in #4609 for the classification reference. We should try doing the same for the rest of the references (detection, segmentation, similarity, video_classification):
- remove the cudnn auto benchmarking when test-only is True.
- set shuffle=False for the test_dataloader
- Add a
--use-deterministic-algorithms
flag to the scripts - Add a warning when the number of processed samples in the validation is different from
len(dataset)
(this one might not be relevant for the detection scripts)
Tackling this issue requires access to at least 1 GPU to make sure the new evaluation scores are similar and more stable than the previous ones.
cc @datumbox