Skip to content

Fix quantization error on Reference Scripts #4722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions references/classification/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ def synchronize_between_processes(self):
Warning: does not synchronize the deque!
"""
t = reduce_across_processes([self.count, self.total])
t = t.tolist()
self.count = int(t[0])
self.total = t[1]

Expand Down Expand Up @@ -407,4 +406,4 @@ def reduce_across_processes(val):
t = torch.tensor(val, device="cuda")
dist.barrier()
dist.all_reduce(t)
return t
return t.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work if we kept return t here?

The problem with using t.tolist() here is that we would need to change https://github.com/pytorch/vision/blob/main/references/classification/train.py#L78 from

num_processed_samples = utils.reduce_across_processes(num_processed_samples)

to

num_processed_samples = utils.reduce_across_processes(num_processed_samples)[0]

because otherwise the code after that wouldn't work as expected: comparing an int to a tensor of length 1 works, but we can't compare an int with a list of length 1 in the same way.

But then if we used num_processed_samples = utils.reduce_across_processes(num_processed_samples)[0], we would have a similar problem in the non-distributed setting: we can't index an integer.

I feel like just removing the tolist() call is actually enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that val was supposed to be a list (because of line 32). The issue is that when not in a distributed setting, the return of line 405 will cause the subsequent tolist call to fail.

If val can also be an integer, I think that's an issue. Perhaps specifying the typing info of val can make things clearer. Or alternatively the val parameter should be renamed and be of a single type (for example list).

I'm going to close the PR and let you choose the solution you would like for this bug. Let me know when you have it to help you with the reviw.