[WIP] Codebook quantization flow #1299

DerekLiu35 · 2024-11-16T19:57:50Z

This PR adds codebook quantization flow per #1195

Usage

import torch
from torchao.prototype.quantization.codebook.codebook_quantized_tensor import CodebookQuantizedTensor

input_tensor = torch.randn(1024, 1024,  device='cuda')

block_size = (1, 1)
code_dtype = torch.uint4

quantized_tensor = CodebookQuantizedTensor.from_float(input_tensor, block_size, code_dtype)

dequantized_tensor = quantized_tensor.dequantize()

ToDo

make fit_kmeans faster. Right now it takes >1 hour if you try to quantize a 1B model.

pytorch-bot · 2024-11-16T19:57:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1299

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b938a7c with merge base 46b8796 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/integration/test_integration.py::TestSubclass::test_int8_dynamic_quant_subclass_api_5_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-11-19T18:20:35Z

thanks for the contribution! yeah "> 1hour" seems a bit too slow, any ideas to speedup?

jerryzh168 · 2024-11-19T18:22:22Z

also after this is done, it would useful if you can add codebookquant to generate.py (

ao/torchao/_models/llama/generate.py

Line 209 in b714026

if quantization:

) and eval (

ao/torchao/_models/llama/eval.py

Line 71 in b714026

if quantization:

) to test the e2e model performance and accuracy

DerekLiu35 · 2024-11-19T23:53:48Z

thanks for the contribution! yeah "> 1hour" seems a bit too slow, any ideas to speedup?

I think

For block_size = (1, 1), It's similar to nf4tensor, so we can use absolute distance for scalars instead of euclidean distance
We could also decrease max_iter from 1000 to 200 for fit_kmeans but this would increase quantization error.

DerekLiu35 · 2024-12-01T02:13:42Z

changed max_iter from 1000 to 200 added codebook to eval and generate
I also added scales with group size 64 since when I ran eval codebook with torch.uint4 it was getting high perplexity, the perplexity is still ~200 even with scales with group size 64

jerryzh168 · 2024-12-02T20:50:00Z

changed max_iter from 1000 to 200 added codebook to eval and generate I also added scales with group size 64 since when I ran eval codebook with torch.uint4 it was getting high perplexity, the perplexity is still ~200 even with scales with group size 64

thanks, why the perplexity is so high? we get around 12/13 when using int4wo-64 on llama2: https://github.com/pytorch/ao/tree/main/torchao/quantization#cuda-backend

DerekLiu35 · 2024-12-04T12:30:07Z

thanks, why the perplexity is so high? we get around 12/13 when using int4wo-64 on llama2: https://github.com/pytorch/ao/tree/main/torchao/quantization#cuda-backend

I found that it was because I was setting scales to the norm of each scale group instead of max of each scale group. If you set scales to be max of each scale group wikitext perplexity is ~11.6 for
llama-3.2-3B.
Though, AQLM initializes their scales as the norm of each scale group instead of max, but I guess they only do vector quantization and no scalar quantization.

jerryzh168 · 2024-12-04T18:46:41Z

Thanks if both performance and accuracy are reasonable, I think the main thing is to add a section for codebook quant: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#uintx-quantization and show the perplexity and token/s result from eval.py and generate.py

jerryzh168 · 2024-12-04T18:50:04Z

torchao/prototype/quantization/codebook/codebook_ops.py

+    else:
+        codebook_size = _DTYPE_TO_QVALUE_BOUNDS[code_dtype][1] + 1
+
+    out_block_size, in_block_size = block_size


block_size is a general arg that allows people to do all kinds of granularities:

ao/torchao/quantization/quant_primitives.py

Lines 277 to 287 in 53d2486

Note:

How can block_size represent different granularities?

let's say we have a Tensor of size: (3, 3, 10, 10), here is the table showing how block_size represents different

granularities:

granularity type | block_size

per_tensor | (3, 3, 10, 10)

per_axis (axis=0) | (1, 3, 10, 10)

per_axis (axis=1) | (3, 1, 10, 10)

per_group (groupsize=2) | (3, 3, 10, 2)

per_group (groupsize=2) for axis = 3 | (3, 3, 2, 10)

, you can also refer to

ao/torchao/quantization/quant_primitives.py

Lines 367 to 375 in 53d2486

shape_for_reduction, reduction_dims = _get_reduction_params(

block_size, input.size()

)

original_shape = input.shape

input = input.view(shape_for_reduction)

shape_after_reduction = shape_for_reduction

for i in reduction_dims:

shape_after_reduction[i] = 1

scale = scale.view(shape_after_reduction)

for some helper functions that helps to make the shape correct

is there a reason why it's assumed to be 2d here?

I guess it's fine we start with 2d for now as well, but would be good to add an assert and create an issue for further development

I made it 2d because that is how it was implemented in AQLM code. The code will be a little more complicated, but it should be possible to generalize to more than 2d.

DerekLiu35 · 2024-12-06T21:59:32Z

benchmarks were run on a single NVIDIA-A6000 GPU.

Model	Technique	wikitext-perplexity	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-3-8B	Base (bfloat16)	7.590	32.36	485.71	16.19	15.01
	codebook-4-64	9.533	1.73	8.62	23.11	4.98
Llama-3.1-8B	Base (bfloat16)	7.713	32.16	482.70	16.35	15.01
	codebook-4-64	10.095	1.73	8.63	23.11	4.98

Seems slow, might need custom kernels

jerryzh168 · 2024-12-06T22:53:22Z

torchao/prototype/quantization/codebook/codebook_quantized_tensor.py

+        else:
+            codes = self.codes
+        if codes.dtype == torch.uint8:
+            codes = codes.to(torch.int32)  # not sure how to index with uint8


is this needed? this might be the reason why it's slow, what do you mean by index?

I think so. I do indexing in dequant = codebook[codes] in dequantize_codebook. I got an error when I tried doing codebook[codes] when codes was uint8.

what is the error? might be easy to support this in pytorch I feel

only int and long are supported right now: https://github.com/pytorch/pytorch/blob/6e203ae6deaceb370e497bd50f2d02e894f5e9cc/aten/src/ATen/Dispatch.h#L801

UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead.
IndexError: The shape of the mask [2] at index 0 does not match the shape of the indexed tensor [3, 3] at index 0.

OK, let's add a TODO here to follow up and land this for now, is perplexity number expected here? it looks like it's slightly worse than int4wo: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#cuda-backend

I think perplexity is expected to be slightly worse than asymmetric int4wo (with scales and zero points) because I only add scales. But it should usually be better than symmetric int4wo (only scales, no zero points), though I initialize the centroids in kmeans randomly so it could be worse if centroids have bad initialization.

1 - Fix an extraneous skip end that is out of order with a skip begin. 2 - fix some typos PS: This might cause some README tests to fail, as they have not been run in a long time.

jerryzh168 · 2024-12-11T20:06:58Z

I think model level benchmark is fine for now, maybe add some unittest to test some basic functionality like getting codebook, quantize, dequantize before landing?

jerryzh168 · 2024-12-11T20:08:51Z

also for model level, is it possible to repro some accuracy result from AQLM: https://github.com/Vahe1994/AQLM/tree/main

DerekLiu35 · 2024-12-13T23:32:21Z

I tested if I could convert an AQLM quantize model to my implementation as a sanity check and it seems like it works https://gist.github.com/DerekLiu35/c1cb9594c515e92c64762cbc8d087f7a.
Though, I get ~300 perplexity for llama-3-8b when I do similar setting as AQLM (block_size=(1, 8), code_dtype=torch.int32 (which makes codebook_size = 2**16), and scale_block_size=input.shape[1]). Maybe the extra tuning is necessary for lower perplexity?

jerryzh168 · 2024-12-13T23:45:35Z

Thanks! So the representation can be verified, but we are not sure how to repro the accuracy, and in original AQLM they'd need full model finetuning to get a good accuracy, seems like a good next step, if you are interested in improving this further

jerryzh168 · 2024-12-13T23:46:48Z

torchao/prototype/quantization/codebook/test_codebook_quant.py

+
+        dequant = cqt.dequantize()
+
+        torch.testing.assert_close(dequant, self.input, atol=2, rtol=2)


looks like a large drop, we could use a larger dtype, e.g. torch.uint8, for test I think, so we can get something closer

jerryzh168 · 2024-12-13T23:47:59Z

torchao/prototype/quantization/codebook/test_codebook_quant.py

+        mse = torch.mean((dequant - self.input) ** 2).item()
+        self.assertLess(mse, 0.01)


we typically use SQNR:

ao/torchao/quantization/utils.py

Line 50 in 46b8796

def compute_error(x, y):

, numbers greater than 20 or 30 are reasonable

DerekLiu35 · 2024-12-14T00:56:00Z

Thanks! So the representation can be verified, but we are not sure how to repro the accuracy, and in original AQLM they'd need full model finetuning to get a good accuracy, seems like a good next step, if you are interested in improving this further

Yeah, I'd definitely be interested in trying to implement AQLM to improve accuracy.

jerryzh168 · 2024-12-14T01:39:20Z

test/prototype/test_codebook_quant.py

+
+        dequant = cqt.dequantize()
+
+        torch.testing.assert_close(dequant, self.input, atol=0.1, rtol=0.1)


nit: it's fine to just rely on sqnr for error checking btw.

jerryzh168

Thanks for the contribution and addressing all the comments!

* Add codebook_ops * Add codebook_quanized_tensor * Add __init__.py * Fix uint8 indexing * Add codebook_weight_only * add codebook to eval and generate * Make scales max of scale group if block_size = (1, 1) * generalize block_size to more than 2d * add codebook section to README * add greedy init to means * change codes casting condition * Update __init__.py * Add tests * add TODO * make multiplication inplace * store codebook and scales in input_tensor.dtype instead of float32 * update tests * remove torch.allclose check

jcaip · 2025-02-04T21:14:18Z

torchao/prototype/quantization/codebook/codebook_quantized_tensor.py

+@implements_torch_function(torch.Tensor.detach)
+def function_detach(tensor, *args, **kwargs):
+    return tensor.detach()


@mostafaelhoushi

To overload linear:

@implements_torch_function(nn.functional.linear) def function_linear(tensor, *args, **kwargs): breakpoint() // torch.ops.torchao.tinygemm(args) return tensor.detach()

DerekLiu35 added 2 commits November 16, 2024 14:53

Add codebook_ops

a6ddac0

Add codebook_quanized_tensor

Loading
Loading status checks…

ebfcf6c

facebook-github-bot added the CLA Signed label Nov 16, 2024

Add __init__.py

Loading
Loading status checks…

9d06e13

Merge branch 'pytorch:main' into main

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

f9c548b

DerekLiu35 and others added 4 commits November 30, 2024 20:45

Merge branch 'pytorch:main' into main

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

a203432

Fix uint8 indexing

ad471b5

Add codebook_weight_only

30367cd

add codebook to eval and generate

Loading
Loading status checks…

e883816

DerekLiu35 and others added 2 commits December 4, 2024 07:28

Merge branch 'pytorch:main' into main

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

ad41d54

Make scales max of scale group if block_size = (1, 1)

Loading
Loading status checks…

53874a0

jerryzh168 reviewed Dec 4, 2024

View reviewed changes

DerekLiu35 and others added 3 commits December 6, 2024 16:43

Merge branch 'pytorch:main' into main

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

d1a783d

generalize block_size to more than 2d

41556a7

add codebook section to README

Loading
Loading status checks…

8f32710

jerryzh168 reviewed Dec 6, 2024

View reviewed changes

DerekLiu35 and others added 2 commits December 13, 2024 18:09

Merge branch 'pytorch:main' into main

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

6955e8a

add greedy init to means

78e3292

DerekLiu35 added 6 commits December 13, 2024 18:12

change codes casting condition

dddeba6

Update __init__.py

b92baa1

Add tests

6c6db0e

add TODO

83410e4

make multiplication inplace

c55537e

store codebook and scales in input_tensor.dtype instead of float32

Loading
Loading status checks…

adf9fd7

jerryzh168 reviewed Dec 13, 2024

View reviewed changes

update tests

Loading
Loading status checks…

c6bb0b3

jerryzh168 reviewed Dec 14, 2024

View reviewed changes

jerryzh168 approved these changes Dec 14, 2024

View reviewed changes

remove torch.allclose check

Loading
Loading status checks…

b938a7c

jerryzh168 added topic: not user facing topic: new feature labels Dec 14, 2024

jerryzh168 merged commit bc000aa into pytorch:main Dec 17, 2024
18 of 20 checks passed

jerryzh168 mentioned this pull request Dec 30, 2024

Add codebook (look up table based) quantization flow in torchao #1195

Open

jcaip reviewed Feb 4, 2025

View reviewed changes

	Note:
	How can block_size represent different granularities?
	let's say we have a Tensor of size: (3, 3, 10, 10), here is the table showing how block_size represents different
	granularities:

	granularity type \| block_size
	per_tensor \| (3, 3, 10, 10)
	per_axis (axis=0) \| (1, 3, 10, 10)
	per_axis (axis=1) \| (3, 1, 10, 10)
	per_group (groupsize=2) \| (3, 3, 10, 2)
	per_group (groupsize=2) for axis = 3 \| (3, 3, 2, 10)

	shape_for_reduction, reduction_dims = _get_reduction_params(
	block_size, input.size()
	)
	original_shape = input.shape
	input = input.view(shape_for_reduction)
	shape_after_reduction = shape_for_reduction
	for i in reduction_dims:
	shape_after_reduction[i] = 1
	scale = scale.view(shape_after_reduction)


		dequant = cqt.dequantize()

		torch.testing.assert_close(dequant, self.input, atol=2, rtol=2)

		mse = torch.mean((dequant - self.input) ** 2).item()
		self.assertLess(mse, 0.01)

[WIP] Codebook quantization flow #1299

[WIP] Codebook quantization flow #1299

Conversation

DerekLiu35 commented Nov 16, 2024

Usage

ToDo

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1299

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

jerryzh168 commented Nov 19, 2024

Uh oh!

jerryzh168 commented Nov 19, 2024

Uh oh!

Uh oh!

DerekLiu35 commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

DerekLiu35 commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DerekLiu35 commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Dec 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DerekLiu35 commented Dec 6, 2024

Uh oh!

jerryzh168 Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DerekLiu35 Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Dec 11, 2024

Uh oh!

jerryzh168 commented Dec 11, 2024

Uh oh!

Uh oh!

Uh oh!

DerekLiu35 commented Dec 13, 2024

Uh oh!

jerryzh168 commented Dec 13, 2024

Uh oh!

jerryzh168 Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pytorch-bot bot commented Nov 16, 2024 •

edited

Loading

DerekLiu35 commented Dec 1, 2024 •

edited

Loading

jerryzh168 commented Dec 2, 2024 •

edited

Loading

DerekLiu35 commented Dec 4, 2024 •

edited

Loading

jerryzh168 Dec 6, 2024 •

edited

Loading

DerekLiu35 Dec 6, 2024 •

edited

Loading

jerryzh168 Dec 13, 2024 •

edited

Loading