Change to stable sort in nms implementations #4767

jdsgomes · 2021-10-27T15:57:03Z

relates to #4491
triggered by #4766

I am trying to replicate the errors and follow up on the work done in the initial PR in order to fix them and introduce the stable sort (given that is acceptably fast)

This is an investigation PR and pretty much work in progree

cc @pmeier

facebook-github-bot · 2021-10-27T15:57:11Z

💊 CI failures summary and remediations

As of commit 90fdb26 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…vision into nms_stable_sort_investigations

jdsgomes · 2021-11-19T11:53:43Z

Benchmark results are reassuring, the difference between using stable sort and not using seems negligible.

Code used for benchmark:

import torch
from time import time
from torchvision import models, ops


def _create_tensors_with_iou(N, iou_thresh):
    # force last box to have a pre-defined iou with the first box
    # let b0 be [x0, y0, x1, y1], and b1 be [x0, y0, x1 + d, y1],
    # then, in order to satisfy ops.iou(b0, b1) == iou_thresh,
    # we need to have d = (x1 - x0) * (1 - iou_thresh) / iou_thresh
    # Adjust the threshold upward a bit with the intent of creating
    # at least one box that exceeds (barely) the threshold and so
    # should be suppressed.
    boxes = torch.rand(N, 4) * 100
    boxes[:, 2:] += boxes[:, :2]
    boxes[-1, :] = boxes[0, :]
    x0, y0, x1, y1 = boxes[-1].tolist()
    iou_thresh += 1e-5
    boxes[-1, 2] += (x1 - x0) * (1 - iou_thresh) / iou_thresh
    scores = torch.rand(N)
    return boxes, scores

def _create_random_boxes(N):
    boxes = torch.rand(N, 4) * 100
    scores = torch.rand(N)
    return boxes, scores

run_on="cuda"
for n in range(100, 10001, 100):
    boxes, scores = _create_random_boxes(n)
    boxes = boxes.to(device=run_on)
    scores = scores.to(device=run_on)
    start = time()
    keep = ops.nms(boxes, scores, 0.5)
    end = time()
    time_for_random = end-start
    boxes, scores = _create_tensors_with_iou(n, 0.5)
    boxes = boxes.to(device=run_on)
    scores = scores.to(device=run_on)
    start = time()
    keep = ops.nms(boxes, scores, 0.5)
    end = time()
    time_for_twiou = end - start
    print(f'{n}\t{time_for_random}\t{time_for_twiou}')

datumbox

Thanks @jdsgomes, this looks great to me!

This might resolve a lot of the flakiness we face on the model tests. Could you create a new issue to investigate if upon merging this, we can now remove hacks/workaround from:

vision/test/test_models.py

Lines 658 to 672 in f093d08

    
           # Unfortunately detection models are flaky due to the unstable sort 
        
           # in NMS. If matching across all outputs fails, use the same approach 
        
           # as in NMSTester.test_nms_cuda to see if this is caused by duplicate 
        
           # scores. 
        
           expected_file = _get_expected_file(model_name) 
        
           expected = torch.load(expected_file) 
        
           torch.testing.assert_close( 
        
               output[0]["scores"], expected[0]["scores"], rtol=prec, atol=prec, check_device=False, check_dtype=False 
        
           ) 
        
           # Note: Fmassa proposed turning off NMS by adapting the threshold 
        
           # and then using the Hungarian algorithm as in DETR to find the 
        
           # best match between output and expected boxes and eliminate some 
        
           # of the flakiness. Worth exploring. 
        
           return False  # Partial validation performed

Note that it's likely that we will need to recreate the expected values of the tests. This can be part of our test improvement efforts work for next half. cc @NicolasHug @pmeier

github-actions · 2021-11-22T10:43:46Z

Hey @jdsgomes!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

NicolasHug · 2021-11-22T12:45:06Z

I'll add the BC-breaking tag just so that we remember to add a short note in the release notes. This isn't BC-breaking strictly speaking as the order of ties isn't guaranteed, but it's be nice to add a short note to let users know that they might get slightly different results in some rare cases.

Regarding the benchmark: for GPU benchmarks we should call torch.cuda.synchronize() before calling time(). This is because CUDA kernels run asynchronously, so you can reach time() before the GPU operation has actually finished.
Optionally you can also use pytorch benchmarks utils: https://pytorch.org/tutorials/recipes/recipes/benchmark.html

Summary: * change to stable sort in nms implementations Reviewed By: NicolasHug Differential Revision: D32694315 fbshipit-source-id: e2ff4d0ed84ca7a4ef2982f4d9bb3192a88dc9b0

change to stable sort in nms implementations

afd7b03

pytorch-probot bot added the ciflow/default label Oct 27, 2021

facebook-github-bot added the cla signed label Oct 27, 2021

jdsgomes marked this pull request as draft October 27, 2021 15:57

datumbox mentioned this pull request Nov 11, 2021

Set seed on test_nms_ref to reduce flakiness #4911

Merged

jdsgomes and others added 5 commits November 18, 2021 12:00

fixing order of parameters in cpp sort function

72ef3fd

fixing typo

f9dd53d

Merge branch 'main' into nms_stable_sort_investigations

16daea1

fixing clang format

a88eed4

Merge branch 'nms_stable_sort_investigations' of github.com:jdsgomes/…

ba11c6c

…vision into nms_stable_sort_investigations

jdsgomes requested a review from fmassa November 19, 2021 11:53

jdsgomes marked this pull request as ready for review November 19, 2021 11:53

jdsgomes changed the title ~~[WIP] change to stable sort in nms implementations~~ Change to stable sort in nms implementations Nov 19, 2021

jdsgomes requested a review from datumbox November 19, 2021 13:08

datumbox approved these changes Nov 22, 2021

View reviewed changes

Merge branch 'main' into nms_stable_sort_investigations

90fdb26

jdsgomes mentioned this pull request Nov 22, 2021

Remove workarounds in tests that are no longer flaky due to stable sort #4970

Closed

jdsgomes merged commit bae1d7e into pytorch:main Nov 22, 2021

jdsgomes deleted the nms_stable_sort_investigations branch November 22, 2021 10:43

jdsgomes added the module: tests label Nov 22, 2021

NicolasHug added bc-breaking enhancement module: ops and removed module: tests labels Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change to stable sort in nms implementations #4767

Change to stable sort in nms implementations #4767

Uh oh!

jdsgomes commented Oct 27, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

facebook-github-bot commented Oct 27, 2021 •

edited

Loading

Uh oh!

jdsgomes commented Nov 19, 2021

Uh oh!

datumbox left a comment

Uh oh!

github-actions bot commented Nov 22, 2021

Uh oh!

NicolasHug commented Nov 22, 2021

Uh oh!

Uh oh!

	# Unfortunately detection models are flaky due to the unstable sort
	# in NMS. If matching across all outputs fails, use the same approach
	# as in NMSTester.test_nms_cuda to see if this is caused by duplicate
	# scores.
	expected_file = _get_expected_file(model_name)
	expected = torch.load(expected_file)
	torch.testing.assert_close(
	output[0]["scores"], expected[0]["scores"], rtol=prec, atol=prec, check_device=False, check_dtype=False
	)

	# Note: Fmassa proposed turning off NMS by adapting the threshold
	# and then using the Hungarian algorithm as in DETR to find the
	# best match between output and expected boxes and eliminate some
	# of the flakiness. Worth exploring.
	return False # Partial validation performed

Change to stable sort in nms implementations #4767

Change to stable sort in nms implementations #4767

Uh oh!

Conversation

jdsgomes commented Oct 27, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

jdsgomes commented Nov 19, 2021

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 22, 2021

Uh oh!

NicolasHug commented Nov 22, 2021

Uh oh!

Uh oh!

jdsgomes commented Oct 27, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Oct 27, 2021 •

edited

Loading