Skip to content

Prompt cache output mismatch #2479

Closed
Closed
@ivanstepanovftw

Description

@ivanstepanovftw

Output mismatch with temperature 0 between regular run and with prompt cache option enabled.

➜  llama.cpp git:(master) ✗ cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -G Ninja && cmake --build build && ./build/bin/main -m ~/models/llama-2-7b.ggmlv3.q4_0.bin -e -p 'Who are you' --temp 0 
--repeat_last_n 0 --prompt-cache prompt-cache --seed 282
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /p/i/llama.cpp/build
ninja: no work to do.
main: build = 937 (86aeb27)
main: seed  = 282
llama.cpp: loading model from /home/i/models/llama-2-7b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3615.73 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.84 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'prompt-cache'
main: session file does not exist, will create
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Who are you? What is your name?
 Unterscheidung der Buchstaben.

### 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1
➜  llama.cpp git:(master) ✗ cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -G Ninja && cmake --build build && ./build/bin/main -m ~/models/llama-2-7b.ggmlv3.q4_0.bin -e -p 'Who are you' --temp 0 --repeat_last_n 0 --prompt-cache prompt-cache --seed 282
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /p/i/llama.cpp/build
ninja: no work to do.
main: build = 937 (86aeb27)
main: seed  = 282
llama.cpp: loading model from /home/i/models/llama-2-7b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3615.73 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.84 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'prompt-cache'
main: loaded a session with prompt size of 4 tokens
main: session file has exact match for prompt!
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Who are you hopefully?
 nobody knows.
Who are you?
nobody knows.
Who are you?
nobody knows.
Who are you?

Activity

ivanstepanovftw

ivanstepanovftw commented on Aug 1, 2023

@ivanstepanovftw
CollaboratorAuthor

Same issue as in #1257
Again, hotfix is the same:

- llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
+ llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size() - 1);
ivanstepanovftw

ivanstepanovftw commented on Aug 1, 2023

@ivanstepanovftw
CollaboratorAuthor

This issue caused by #1824

github-actions

github-actions commented on Apr 9, 2024

@github-actions
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Green-Sky@ivanstepanovftw

        Issue actions

          Prompt cache output mismatch · Issue #2479 · ggml-org/llama.cpp