Fix quantization error on Reference Scripts #4722

datumbox · 2021-10-22T11:53:15Z

Running the quantization script after #4609 leads to:

Traceback (most recent call last):
  File "./vision/references/classification/train_quantization.py", line 257, in <module>
    main(args)
  File "./vision/references/classification/train_quantization.py", line 103, in main
    evaluate(model, criterion, data_loader_test, device=device)
  File "./vision/references/classification/train.py", line 92, in evaluate
    metric_logger.synchronize_between_processes()
  File "./vision/references/classification/utils.py", line 95, in synchronize_between_processes
    meter.synchronize_between_processes()
  File "./vision/references/classification/utils.py", line 36, in synchronize_between_processes
    t = t.tolist()
AttributeError: 'list' object has no attribute 'tolist'

This is because the quantization script doesn't run in distributed mode so the reduce_across_processes() returns a list not a tensor.

NicolasHug · 2021-10-25T10:43:30Z

references/classification/utils.py

@@ -407,4 +406,4 @@ def reduce_across_processes(val):
    t = torch.tensor(val, device="cuda")
    dist.barrier()
    dist.all_reduce(t)
-    return t
+    return t.tolist()


Would it work if we kept return t here?

The problem with using t.tolist() here is that we would need to change https://github.com/pytorch/vision/blob/main/references/classification/train.py#L78 from

num_processed_samples = utils.reduce_across_processes(num_processed_samples)

to

num_processed_samples = utils.reduce_across_processes(num_processed_samples)[0]

because otherwise the code after that wouldn't work as expected: comparing an int to a tensor of length 1 works, but we can't compare an int with a list of length 1 in the same way.

But then if we used num_processed_samples = utils.reduce_across_processes(num_processed_samples)[0], we would have a similar problem in the non-distributed setting: we can't index an integer.

I feel like just removing the tolist() call is actually enough?

My understanding was that val was supposed to be a list (because of line 32). The issue is that when not in a distributed setting, the return of line 405 will cause the subsequent tolist call to fail.

If val can also be an integer, I think that's an issue. Perhaps specifying the typing info of val can make things clearer. Or alternatively the val parameter should be renamed and be of a single type (for example list).

I'm going to close the PR and let you choose the solution you would like for this bug. Let me know when you have it to help you with the reviw.

Fix quantization error.

7cf054d

datumbox added bug module: reference scripts labels Oct 22, 2021

datumbox requested a review from NicolasHug October 22, 2021 11:53

facebook-github-bot added the cla signed label Oct 22, 2021

pytorch-probot bot added the ciflow/default label Oct 22, 2021

datumbox changed the title ~~Fix quantization error~~ Fix quantization error on Reference Scripts Oct 22, 2021

NicolasHug reviewed Oct 25, 2021

View reviewed changes

datumbox closed this Oct 25, 2021

NicolasHug mentioned this pull request Oct 25, 2021

Fix reduce_across_processes() inconsistent return type #4733

Merged

datumbox deleted the bug/quantize_issue branch October 25, 2021 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix quantization error on Reference Scripts #4722

Fix quantization error on Reference Scripts #4722

datumbox commented Oct 22, 2021 •

edited

Loading

NicolasHug Oct 25, 2021

datumbox Oct 25, 2021

Fix quantization error on Reference Scripts #4722

Fix quantization error on Reference Scripts #4722

Conversation

datumbox commented Oct 22, 2021 • edited Loading

NicolasHug Oct 25, 2021

Choose a reason for hiding this comment

datumbox Oct 25, 2021

Choose a reason for hiding this comment

datumbox commented Oct 22, 2021 •

edited

Loading