You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The value is passed as is and is not changed to n_vocab if top_k=0.
Why is that a problem?
In the source code of llama.cpp we can see that when k=0 and min_keep=1 it will always default to a maximum of a single candidate, ensuring we only receive the candidate with the highest logit.
voidllama_sample_top_k(structllama_context * ctx, llama_token_data_array * candidates, int k, size_t min_keep) {
constint64_t t_start_sample_us = ggml_time_us();
k = std::max(k, (int) min_keep);
k = std::min(k, (int) candidates->size);
// Sort scores in descending orderif (!candidates->sorted) {
auto comp = [](const llama_token_data & a, const llama_token_data & b) {
return a.logit > b.logit;
};
if (k == (int) candidates->size) {
std::sort(candidates->data, candidates->data + candidates->size, comp);
} else {
std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
}
candidates->sorted = true;
}
candidates->size = k;
if (ctx) {
ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
}
}
This is not an expected functionality, because value of k=0 is meant to mark that top_k sampling is disabled, according to llama.cpp source code:
I've noticed a strange occurrence when trying to generate output. Based on context the bindgins API will always return the same output. Additionally it seems that top_p and temp values are being completely ignored.
This is not the case when running llama.cpp itself.
I am using the latest version (v0.1.50) of llama-cpp-python. I've installed it with cuBLAS support over pip as well as tried compiling it myself, both instances produce the same results.
My example script:
fromllama_cppimportLlamallm=Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40)
tokens=llm.tokenize(b"I am driving down a busy street and notice a plane crashing down. What can I do?")
output=b""count=0fortokeninllm.generate(tokens, top_k=0, top_p=0.73, temp=0.72, repeat_penalty=1.1):
text=llm.detokenize([token])
output+=textcount+=1ifcount>=200or (token==llm.token_eos()):
breakprint(output.decode())
Output example (always the same, regardless of top_p and temp):
$ python test.py
llama.cpp: loading model from models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size = 400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
I am in a car, driving down a busy street when I see a plane flying low overhead. Suddenly, it starts to wobble and lose altitude before plummeting towards the ground. I realize that if I don't do something quickly, the plane is going to crash into the side of a building just ahead of me.
I slam on my brakes and swerve my car into an empty parking lot. As I come to a stop, I see the plane hurtling towards the ground, but it looks like it's going to miss the building by just a few feet.
What can I do? Is there anything I can do to help prevent this crash or minimize its impact?
Now, using llama.cpp I always get a different result:
$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed = 1684140386
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I am driving down a busy street and notice a plane crashing down. What can I do?
I have been trained in CPR, but not in emergency vehicle operation or emergency response. I have my phone with me and I know the location of the nearest hospital. What should I do? [end of text]
llama_print_timings: load time = 2668.27 ms
llama_print_timings: sample time = 72.33 ms / 44 runs ( 1.64 ms per token)
llama_print_timings: prompt eval time = 266.67 ms / 21 tokens ( 12.70 ms per token)
llama_print_timings: eval time = 3027.72 ms / 43 runs ( 70.41 ms per token)
llama_print_timings: total time = 5772.50 ms
$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed = 1684140427
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I am driving down a busy street and notice a plane crashing down. What can I do?
This is a scenario that you may have seen in movies or read about in books, but it's not something that happens every day in real life. If you were to find yourself in this situation, what would you do? Here are some steps you can take to help yourself and others:
1. Stop your car as quickly and safely as possible. Do not try to swerve or brake suddenly, as this could cause a collision with other vehicles on the road. Instead, carefully pull over to the side of the road and turn off the engine.
2. Call 911 immediately. Tell the operator that you have witnessed a plane crash and provide your location. Do not hang up until the operator tells you to do so.
3. Look for anyone who may have been on the plane or anyone who has been injured as a result of the crash. If there are any survivors, try to assist them by providing first aid or comforting them until help arrives.
4. Avoid touching or moving anything that may be hazardous, such as debris from the crash or fuel leaks. Do not try to remove anyone from the wreckage unless they are in imminent danger of being further injured.
5. Stay away from the crash site and do not attempt to take any photos or videos. Your first priority should be assisting those who have been affected by the crash.
6. If you have a camera or phone with you, take pictures of the crash scene from a safe distance. This can help emergency responders and investigators piece together what happened.
7. If you are able to, try to remember as much information as possible about the plane crash, such as the location, time, weather conditions, and any other details that may be relevant.
8. After the incident, contact your loved ones to let them know that you are safe. If you were involved in the crash or witnessed it, seek medical attention if necessary.
Remember, in a situation like this, it's essential to stay calm and focused on helping those who have been affected by the plane crash. Your quick thinking and actions could make a difference in saving lives. [end of text]
llama_print_timings: load time = 2636.17 ms
llama_print_timings: sample time = 756.80 ms / 464 runs ( 1.63 ms per token)
llama_print_timings: prompt eval time = 280.92 ms / 21 tokens ( 13.38 ms per token)
llama_print_timings: eval time = 34963.14 ms / 463 runs ( 75.51 ms per token)
llama_print_timings: total time = 38397.73 ms
Sorry if this is an incorrect place to post something like this, this is my first time posting.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Discussed in #210
Originally posted by EdgarasSk May 15, 2023
Edit:
After some investigation I've identified the problem.
When sampling, top_k value is not being evaluated before being passed into the function
https://github.com/abetlen/llama-cpp-python/blob/1a13d76c487df1c8560132d10bda62d6e2f4fa93/llama_cpp/llama.py#LL367C1-L367C1
The value is passed as is and is not changed to
n_vocab
iftop_k=0
.Why is that a problem?
In the source code of
llama.cpp
we can see that whenk=0
andmin_keep=1
it will always default to a maximum of a single candidate, ensuring we only receive the candidate with the highest logit.This is not an expected functionality, because value of
k=0
is meant to mark thattop_k
sampling is disabled, according tollama.cpp
source code:Hello.
I've noticed a strange occurrence when trying to generate output. Based on context the bindgins API will always return the same output. Additionally it seems that
top_p
andtemp
values are being completely ignored.This is not the case when running llama.cpp itself.
I am using the latest version (v0.1.50) of llama-cpp-python. I've installed it with cuBLAS support over pip as well as tried compiling it myself, both instances produce the same results.
My example script:
Output example (always the same, regardless of top_p and temp):
Now, using llama.cpp I always get a different result:
Sorry if this is an incorrect place to post something like this, this is my first time posting.
The text was updated successfully, but these errors were encountered: