Description
🚀 The feature, motivation and pitch
https://arxiv.org/abs/2401.14112
I think you guys are really going to like this.
The deepspeed developers introduce FP6 datatype on cards without fp8 support, while maintaining full tensor core suppourt using a kernel they created called tc-fpX. Tests were done on a a100! And they achieved 1.69x-2.65x inference performance! And I assume this can be transferred over to training (with the exception of possibly the KV cache, and embedding module). This is really exiting, this will breathe new life into the rapidly aging a100 due to the introduction of the h100’s fp8.
It was merged into deepspeed in this commit:
deepspeedai/DeepSpeed@ccfdb84
Getting this pushed into the Pytorch as a dtype, that would be a major win. These are the benefits FP6 provides:
Alternatives
These kernels shouldn’t be limited by only the a100, they theoretically could work on any card with uint8_t and fp16 support. Provided these kernels were only written for a100 so without modification it might only work on ampere cards.
Additional context
The tc-FPx kernel essentially takes 4fp16 values, quantizes them to fp6 with some place holders. Then they get pushed into an array built of 3x Uint8_t. As shown here:
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel