-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.
ref #5631 (comment)
Steps to Reproduce
- Download safetensors model from https://huggingface.co/google/gemma-7b
- Checkout llama.cpp commit 15499eb (master should reproduce this as well)
./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
make -C build perplexity
- Run perplexity on a Tesla P40. Use
-ngl 2
or above.
$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,
And there's no point in running it longer than that because the running average will stay NaN.
This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.
BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.
cc @JohannesGaessler in case you haven't seen this yet.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working