-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall. The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.
The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower. This makes my idea not work on CPU which makes me sad.
I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.
Maybe this is a performance bug in llama_eval()? The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.
Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis
Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52