Skip to content

llama : tool for evaluating quantization results per layer #2783

Open
@ggerganov

Description

@ggerganov

Following up on #2421, I think we should implement some better way to observe at which point of the inference the results start to deviate significantly between the classical and quantum models.

So I'm thinking of adding a simple tool that takes as input 2 ggml exported graphs - one classical and one quantum, of the same model. The tool evals both graphs on the CPU using ggml and prints detailed statistical information of the intermediate F32 results after each graph node. For example, each result node which has been given a name will be compared and we'll print stuff like, min, max, avg, var, etc.

I'm hoping with such tool to be able to detect which nodes in the computation require more precision in order to keep the quantization differences small enough and hopefully become an automated way of deciding which tensors require more bits than others.

cc @slaren I know you had similar ideas - we can discuss here how to add such support.
Currently I think the ggml graph export/import will be fairly trivial to utilize and will require almost no intervention in the existing llama.cpp implementation. The only thing we might have to take into account is when exporting the graph to disable the allocator so that all results are available in memory after the computation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions