Description
Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.
So I'm thinking of adding a simple tool that takes as input 2 ggml
exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using ggml
and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, min
, max
, avg
, var
, etc.
I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.
cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the ggml
graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing llama.cpp
implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.