Setting top_k to 0 does not disable top_k sampling and forces it to return a single highest logit candidate

### Discussed in https://github.com/abetlen/llama-cpp-python/discussions/210

<div type='discussions-op-text'>

<sup>Originally posted by **EdgarasSk** May 15, 2023</sup>
Edit:

After some investigation I've identified the problem.

### When sampling, top_k value is not being evaluated before being passed into the function
https://github.com/abetlen/llama-cpp-python/blob/1a13d76c487df1c8560132d10bda62d6e2f4fa93/llama_cpp/llama.py#LL367C1-L367C1

The value is passed as is and is not changed to `n_vocab` if `top_k=0`.

### Why is that a problem?
In the source code of `llama.cpp` we can see that when `k=0` and `min_keep=1` it will always default to a maximum of a single candidate, ensuring we only receive the candidate with the highest logit.
```c++
void llama_sample_top_k(struct llama_context * ctx, llama_token_data_array * candidates, int k, size_t min_keep) {
    const int64_t t_start_sample_us = ggml_time_us();

    k = std::max(k, (int) min_keep);
    k = std::min(k, (int) candidates->size);

    // Sort scores in descending order
    if (!candidates->sorted) {
        auto comp = [](const llama_token_data & a, const llama_token_data & b) {
            return a.logit > b.logit;
        };
        if (k == (int) candidates->size) {
            std::sort(candidates->data, candidates->data + candidates->size, comp);
        } else {
            std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
        }
        candidates->sorted = true;
    }
    candidates->size = k;

    if (ctx) {
        ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
    }
}
```

This is not an expected functionality, because value of `k=0` is meant to mark that `top_k` sampling is disabled, according to `llama.cpp` source code:
```c++
    fprintf(stderr, "  --top-k N             top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
    ...
    const int32_t top_k           = params.top_k <= 0 ? llama_n_vocab(ctx) : params.top_k;
```
___
Hello.

I've noticed a strange occurrence when  trying to generate output. Based on context the bindgins API will always return the same output. Additionally it seems that `top_p` and `temp` values are being completely ignored.

This is not the case when running llama.cpp itself.

I am using the latest version (v0.1.50) of llama-cpp-python. I've installed it with cuBLAS support over pip as well as tried compiling it myself, both instances produce the same results.

My example script:
```python
from llama_cpp import Llama
llm = Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40)
tokens = llm.tokenize(b"I am driving down a busy street and notice a plane crashing down. What can I do?")

output = b""
count = 0
for token in llm.generate(tokens, top_k=0, top_p=0.73, temp=0.72, repeat_penalty=1.1):
     text = llm.detokenize([token])
     output += text

     count +=1
     if count >= 200 or (token == llm.token_eos()):
         break

print(output.decode())
```

Output example (always the same, regardless of top_p and temp):
```
$ python test.py
llama.cpp: loading model from models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |


I am in a car, driving down a busy street when I see a plane flying low overhead. Suddenly, it starts to wobble and lose altitude before plummeting towards the ground. I realize that if I don't do something quickly, the plane is going to crash into the side of a building just ahead of me.

I slam on my brakes and swerve my car into an empty parking lot. As I come to a stop, I see the plane hurtling towards the ground, but it looks like it's going to miss the building by just a few feet.

What can I do? Is there anything I can do to help prevent this crash or minimize its impact?
```

Now, using llama.cpp I always get a different result:
```
$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140386
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

I have been trained in CPR, but not in emergency vehicle operation or emergency response. I have my phone with me and I know the location of the nearest hospital. What should I do? [end of text]

llama_print_timings:        load time =  2668.27 ms
llama_print_timings:      sample time =    72.33 ms /    44 runs   (    1.64 ms per token)
llama_print_timings: prompt eval time =   266.67 ms /    21 tokens (   12.70 ms per token)
llama_print_timings:        eval time =  3027.72 ms /    43 runs   (   70.41 ms per token)
llama_print_timings:       total time =  5772.50 ms
```

```
$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140427
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

This is a scenario that you may have seen in movies or read about in books, but it's not something that happens every day in real life. If you were to find yourself in this situation, what would you do? Here are some steps you can take to help yourself and others:

1. Stop your car as quickly and safely as possible. Do not try to swerve or brake suddenly, as this could cause a collision with other vehicles on the road. Instead, carefully pull over to the side of the road and turn off the engine.
2. Call 911 immediately. Tell the operator that you have witnessed a plane crash and provide your location. Do not hang up until the operator tells you to do so.
3. Look for anyone who may have been on the plane or anyone who has been injured as a result of the crash. If there are any survivors, try to assist them by providing first aid or comforting them until help arrives.
4. Avoid touching or moving anything that may be hazardous, such as debris from the crash or fuel leaks. Do not try to remove anyone from the wreckage unless they are in imminent danger of being further injured.
5. Stay away from the crash site and do not attempt to take any photos or videos. Your first priority should be assisting those who have been affected by the crash.
6. If you have a camera or phone with you, take pictures of the crash scene from a safe distance. This can help emergency responders and investigators piece together what happened.
7. If you are able to, try to remember as much information as possible about the plane crash, such as the location, time, weather conditions, and any other details that may be relevant.
8. After the incident, contact your loved ones to let them know that you are safe. If you were involved in the crash or witnessed it, seek medical attention if necessary.

Remember, in a situation like this, it's essential to stay calm and focused on helping those who have been affected by the plane crash. Your quick thinking and actions could make a difference in saving lives. [end of text]

llama_print_timings:        load time =  2636.17 ms
llama_print_timings:      sample time =   756.80 ms /   464 runs   (    1.63 ms per token)
llama_print_timings: prompt eval time =   280.92 ms /    21 tokens (   13.38 ms per token)
llama_print_timings:        eval time = 34963.14 ms /   463 runs   (   75.51 ms per token)
llama_print_timings:       total time = 38397.73 ms
```

Sorry if this is an incorrect place to post something like this, this is my first time posting.</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Setting top_k to 0 does not disable top_k sampling and forces it to return a single highest logit candidate #220

Discussed in #210

When sampling, top_k value is not being evaluated before being passed into the function

Why is that a problem?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Setting top_k to 0 does not disable top_k sampling and forces it to return a single highest logit candidate #220

Description

Discussed in #210

When sampling, top_k value is not being evaluated before being passed into the function

Why is that a problem?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions