llama : tool for evaluating quantization results per layer

Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.

So I'm thinking of adding a simple tool that takes as input 2 `ggml` exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using `ggml` and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, `min`, `max`, `avg`, `var`, etc.

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the `ggml` graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing `llama.cpp` implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : tool for evaluating quantization results per layer #2783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : tool for evaluating quantization results per layer #2783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions