Skip to content

Unexpected performance issue with longer prompts? #938

@catid

Description

@catid

Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall. The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.

The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower. This makes my idea not work on CPU which makes me sad.

I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.

Maybe this is a performance bug in llama_eval()? The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.

Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis

Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions