Closed
Description
-
A16W4 axis=1
- Low hanging fruit we can add to int4wo quant as either a flag or replace the quant method
- test eval with HQQ axis=1 and compare to existing version
- if axis = 1 doesn't get enough accuracy improvement, we could also combine with equalization
- test perf/eval with HQQ axis=1 + equalization
- Low hanging fruit we can add to int4wo quant as either a flag or replace the quant method
-
A16W4+ axis=1
- Can quantize certain columns of W to 4/8 bit
- may be faster to do a 4 bit matmul on all of W and a sparse 8 bit matmul?
- test perf for int4wo + int8 matmul for n columns
- HQQ+ end result is an int4wo matmul + lora matmul
- back of envelope numbers look like 1/3 slowdown over int4 which is still better than int8
- test perf for int4wo + lora
- Can quantize certain columns of W to 4/8 bit
-
A8W4 axis=1
- test eval accuracy with HQQ axis=1 and compare to existing version
-
A16W3 and A16W5
- existing numbers depend on axis = 0, how do these numbers look with axis = 1
- also relevant whether these numbers scale to llama3 since some quantization difficulty has been reported there
- get numbers for 3/5 bit quantization with axis = 1, ideally for llama 3
- existing numbers depend on axis = 0, how do these numbers look with axis = 1