Skip to content

Setting top_k to 0 does not disable top_k sampling and forces it to return a single highest logit candidate #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EdgarasSk opened this issue May 17, 2023 Discussed in #210 · 1 comment
Labels
bug Something isn't working quality Quality of model output

Comments

@EdgarasSk
Copy link

EdgarasSk commented May 17, 2023

Discussed in #210

Originally posted by EdgarasSk May 15, 2023
Edit:

After some investigation I've identified the problem.

When sampling, top_k value is not being evaluated before being passed into the function

https://github.com/abetlen/llama-cpp-python/blob/1a13d76c487df1c8560132d10bda62d6e2f4fa93/llama_cpp/llama.py#LL367C1-L367C1

The value is passed as is and is not changed to n_vocab if top_k=0.

Why is that a problem?

In the source code of llama.cpp we can see that when k=0 and min_keep=1 it will always default to a maximum of a single candidate, ensuring we only receive the candidate with the highest logit.

void llama_sample_top_k(struct llama_context * ctx, llama_token_data_array * candidates, int k, size_t min_keep) {
    const int64_t t_start_sample_us = ggml_time_us();

    k = std::max(k, (int) min_keep);
    k = std::min(k, (int) candidates->size);

    // Sort scores in descending order
    if (!candidates->sorted) {
        auto comp = [](const llama_token_data & a, const llama_token_data & b) {
            return a.logit > b.logit;
        };
        if (k == (int) candidates->size) {
            std::sort(candidates->data, candidates->data + candidates->size, comp);
        } else {
            std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
        }
        candidates->sorted = true;
    }
    candidates->size = k;

    if (ctx) {
        ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
    }
}

This is not an expected functionality, because value of k=0 is meant to mark that top_k sampling is disabled, according to llama.cpp source code:

    fprintf(stderr, "  --top-k N             top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
    ...
    const int32_t top_k           = params.top_k <= 0 ? llama_n_vocab(ctx) : params.top_k;

Hello.

I've noticed a strange occurrence when trying to generate output. Based on context the bindgins API will always return the same output. Additionally it seems that top_p and temp values are being completely ignored.

This is not the case when running llama.cpp itself.

I am using the latest version (v0.1.50) of llama-cpp-python. I've installed it with cuBLAS support over pip as well as tried compiling it myself, both instances produce the same results.

My example script:

from llama_cpp import Llama
llm = Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40)
tokens = llm.tokenize(b"I am driving down a busy street and notice a plane crashing down. What can I do?")

output = b""
count = 0
for token in llm.generate(tokens, top_k=0, top_p=0.73, temp=0.72, repeat_penalty=1.1):
     text = llm.detokenize([token])
     output += text

     count +=1
     if count >= 200 or (token == llm.token_eos()):
         break

print(output.decode())

Output example (always the same, regardless of top_p and temp):

$ python test.py
llama.cpp: loading model from models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |


I am in a car, driving down a busy street when I see a plane flying low overhead. Suddenly, it starts to wobble and lose altitude before plummeting towards the ground. I realize that if I don't do something quickly, the plane is going to crash into the side of a building just ahead of me.

I slam on my brakes and swerve my car into an empty parking lot. As I come to a stop, I see the plane hurtling towards the ground, but it looks like it's going to miss the building by just a few feet.

What can I do? Is there anything I can do to help prevent this crash or minimize its impact?

Now, using llama.cpp I always get a different result:

$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140386
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

I have been trained in CPR, but not in emergency vehicle operation or emergency response. I have my phone with me and I know the location of the nearest hospital. What should I do? [end of text]

llama_print_timings:        load time =  2668.27 ms
llama_print_timings:      sample time =    72.33 ms /    44 runs   (    1.64 ms per token)
llama_print_timings: prompt eval time =   266.67 ms /    21 tokens (   12.70 ms per token)
llama_print_timings:        eval time =  3027.72 ms /    43 runs   (   70.41 ms per token)
llama_print_timings:       total time =  5772.50 ms
$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140427
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

This is a scenario that you may have seen in movies or read about in books, but it's not something that happens every day in real life. If you were to find yourself in this situation, what would you do? Here are some steps you can take to help yourself and others:

1. Stop your car as quickly and safely as possible. Do not try to swerve or brake suddenly, as this could cause a collision with other vehicles on the road. Instead, carefully pull over to the side of the road and turn off the engine.
2. Call 911 immediately. Tell the operator that you have witnessed a plane crash and provide your location. Do not hang up until the operator tells you to do so.
3. Look for anyone who may have been on the plane or anyone who has been injured as a result of the crash. If there are any survivors, try to assist them by providing first aid or comforting them until help arrives.
4. Avoid touching or moving anything that may be hazardous, such as debris from the crash or fuel leaks. Do not try to remove anyone from the wreckage unless they are in imminent danger of being further injured.
5. Stay away from the crash site and do not attempt to take any photos or videos. Your first priority should be assisting those who have been affected by the crash.
6. If you have a camera or phone with you, take pictures of the crash scene from a safe distance. This can help emergency responders and investigators piece together what happened.
7. If you are able to, try to remember as much information as possible about the plane crash, such as the location, time, weather conditions, and any other details that may be relevant.
8. After the incident, contact your loved ones to let them know that you are safe. If you were involved in the crash or witnessed it, seek medical attention if necessary.

Remember, in a situation like this, it's essential to stay calm and focused on helping those who have been affected by the plane crash. Your quick thinking and actions could make a difference in saving lives. [end of text]

llama_print_timings:        load time =  2636.17 ms
llama_print_timings:      sample time =   756.80 ms /   464 runs   (    1.63 ms per token)
llama_print_timings: prompt eval time =   280.92 ms /    21 tokens (   13.38 ms per token)
llama_print_timings:        eval time = 34963.14 ms /   463 runs   (   75.51 ms per token)
llama_print_timings:       total time = 38397.73 ms

Sorry if this is an incorrect place to post something like this, this is my first time posting.

@abetlen
Copy link
Owner

abetlen commented May 17, 2023

@EdgarasSk thanks for reporting this and the thorough explanation, yes it looks like I missed this line https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#LL383C99-L383C99

Should have this fixed in the next release.

@gjmulder gjmulder added bug Something isn't working quality Quality of model output labels May 17, 2023
carmonajca added a commit to carmonajca/llama-cpp-python that referenced this issue May 17, 2023
* Bugfix: Ensure logs are printed when streaming

* Update llama.cpp

* Update llama.cpp

* Add missing tfs_z paramter

* Bump version

* Fix docker command

* Revert "llama_cpp server: prompt is a string". Closes abetlen#187

This reverts commit b9098b0.

* Only support generating one prompt at a time.

* Allow model to tokenize strings longer than context length and set add_bos. Closes abetlen#92

* Update llama.cpp

* Bump version

* Update llama.cpp

* Fix obscure Wndows DLL issue. Closes abetlen#208

* chore: add note for Mac m1 installation

* Add winmode arg only on windows if python version supports it

* Bump mkdocs-material from 9.1.11 to 9.1.12

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.1.11 to 9.1.12.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.1.11...9.1.12)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update README.md

Fix typo.

* Fix CMakeLists.txt

* Add sampling defaults for generate

* Update llama.cpp

* Add model_alias option to override model_path in completions. Closes abetlen#39

* Update variable name

* Update llama.cpp

* Fix top_k value. Closes abetlen#220

* Fix last_n_tokens_size

* Implement penalize_nl

* Format

* Update token checks

* Move docs link up

* Fixd CUBLAS dll load issue in Windows

* Check for CUDA_PATH before adding

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Andrei Betlen <[email protected]>
Co-authored-by: Anchen <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xiyou Zhou <[email protected]>
Co-authored-by: Aneesh Joy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working quality Quality of model output
Projects
None yet
Development

No branches or pull requests

3 participants