Improve GemLite Integration #2096

mobicham · 2025-04-22T10:34:57Z

Tasks

Faster get_plain() via Triton unpacking instead of matmul with the identity matrix -> slicing should be faster.
Fix serialization issues.
General clean-up.
Remove cuda device check in .to() to allow loading models in transformers.
Force cuda device in from_plain / get_plain for vllm weight loader compatibllity.
Add bfloat16 support.

Note: Slicing still performs unpacking -> packing. If we can restrict the step to be self._layout.packing_bitwidth // self._layout.bit_width we can avoid this and slice directly the packed data.

Test

import torch, gemlite
from torchao.quantization import GemliteUIntXWeightOnlyConfig, quantize_
device = 'cuda:0'
dtype = torch.float16

layer = torch.nn.Linear(256, 512, bias=False, dtype=dtype, device=device)
weight = layer.weight.data.clone()
orig_shape = weight.shape

group_size = 64
quantize_(layer, GemliteUIntXWeightOnlyConfig(bit_width=4, group_size=group_size))

#Test dot prod
#################################################
torch.manual_seed(0)
x = torch.randn((1, layer.in_features), device=device, dtype=dtype) / 10.
y_ref = x @ weight.T
y_gem  = layer(x)
assert (y_ref - y_gem).abs().mean() < 5e-3, "Dot product mismatch"

#Test get_plain() 
#################################################
W_q, s, z       = layer.weight.tensor_impl.get_plain()
W_r_getplain = ((W_q.view([-1, group_size]) - z.view(-1, 1)) * s.view(-1, 1)).view(orig_shape)
W_r_dotprod = layer(torch.eye(layer.in_features, device=device, dtype=dtype)).T #SLOW because it autotunes
assert (W_r_getplain - W_r_dotprod).abs().mean() < 1e-4, "get_plain() incorrect results"
assert (y_gem - (x @ W_r_getplain.T)).abs().mean() < 1e-4, "get_plain() incorrect results"

#Test slicing 
#################################################
def _dequant(tensor_impl, in_features, orig_shape):
    int_data   = tensor_impl.packed_weight
    scale      = tensor_impl.scale
    zero_point = tensor_impl.zero_point

    W_q = gemlite.bitpack.unpack_over_rows(int_data, W_nbits=4, num_output_rows=in_features, dtype=torch.uint8).T.contiguous()
    s   = scale.t().contiguous()
    z   = zero_point.t().contiguous()
    return ((W_q.view([-1, group_size]) - z.view(-1, 1)) * s.view(-1, 1)).view(orig_shape)

torch.manual_seed(0)
x     = torch.randn((1, layer.in_features), device=device, dtype=dtype) / 10.
y_ref = layer(x)
W_r   = _dequant(layer.weight.tensor_impl, layer.in_features, orig_shape).T
y_2   = x @ W_r
assert (y_ref - y_2).abs().mean() < 1e-4, "Incorrect dequant results"


layer_sliced = layer.weight.narrow(0, 0, 256)
W_slice1   = _dequant(layer_sliced.tensor_impl, layer.in_features, [orig_shape[0]//2, orig_shape[1]]).T
assert (W_r[:, :256] - W_slice1).abs().mean() == 0, "slice1 along axis=0 is incorrect"

layer_sliced = layer.weight.narrow(0, 256, 256)
W_slice2   = _dequant(layer_sliced.tensor_impl, layer.in_features, [orig_shape[0]//2, orig_shape[1]]).T
assert (W_r[:, 256:] - W_slice2).abs().mean() == 0 , "slice2 along axis=0 is incorrect"

layer_sliced = layer.weight.narrow(1, 0, 128)
W_slice1   = _dequant(layer_sliced.tensor_impl, layer.in_features//2, [orig_shape[0], orig_shape[1]//2]).T
assert (W_r[:128, :] - W_slice1).abs().mean() == 0, "slice1 along axis=1 is incorrect"

layer_sliced = layer.weight.narrow(1, 128, 128)
W_slice2   = _dequant(layer_sliced.tensor_impl, layer.in_features//2, [orig_shape[0], orig_shape[1]//2]).T
assert (W_r[128:, :] - W_slice2).abs().mean() == 0 , "slice2 along axis=1 is incorrect"

Summary: att Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2025-04-22T10:35:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2096

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0e65350 with merge base 11472c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/dtypes/test_affine_quantized.py

torchao/dtypes/uintx/gemlite_layout.py

jerryzh168 · 2025-04-25T22:02:47Z

torchao/testing/utils.py

@@ -91,12 +91,30 @@ def wrapper(*args, **kwargs):


 def skip_if_no_cuda():
-    import unittest
+    import pytest


we might have to use unittest to make CI happy I think

jerryzh168 and others added 6 commits April 17, 2025 11:55

Enable gemlite copy_ and fix slice

36e01d1

Summary: att Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

add test

1286ab2

skip if no gemlite

e23ec91

fix gemlite get_plain-slice

368075f

cleanup

d2a0a07

fix serialization issues

7dfd454

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2025

mobicham and others added 6 commits April 22, 2025 10:40

ruff

051fdfc

ruff testing

ee3f88b

ruff test_affine_quantized.py

2b75399

Merge branch 'main' into gemlite-copy

d028344

ruff test_affine_quantized.py

a57ca9e

ruff

2616a2c

mobicham changed the title ~~Improve GemLite Integration~~ Improve GemLite Integration topic: improvemen Apr 22, 2025

mobicham changed the title ~~Improve GemLite Integration topic: improvemen~~ Improve GemLite Integration Apr 22, 2025

mobicham added 2 commits April 22, 2025 12:19

switch gemlite.helper

7d03c30

remove logging

e742a98

mobicham mentioned this pull request Apr 22, 2025

Allow weights_only=True load for gemlite layout #2081

Closed