Unexpected performance issue with longer prompts?

Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall.  The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.

The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower.  This makes my idea not work on CPU which makes me sad.

I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.

Maybe this is a performance bug in llama_eval()?  The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.

Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis

Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected performance issue with longer prompts? #938

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected performance issue with longer prompts? #938

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions