Skip to content

Compress saved llama state #1247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ivanstepanovftw
Copy link
Collaborator

@ivanstepanovftw ivanstepanovftw commented Apr 30, 2023

Compress saved llama state by using RLE compression. I have also tried lz4 and lzma compression, but it seems like a lot of entropy besides zeroes in the file, so RLE is the best choice for small-mid size prompts. Uncompressed size is 259M, while RLE compressed is 4.4M.

Some parts of the code is generated and needs to be refactored.

If someone is working on implementing prompt cache for examples/main - please let me know.

For our largest prompt reason-act compression results are following:

-rw-r--r--.  1 i i 259M Apr 30 03:35 dump_state.bin
-rw-r--r--.  1 i i 116M Apr 30 03:35 dump_state.bin.9.lz4
-rw-r--r--.  1 i i 116M Apr 30 03:35 dump_state.bin.lz4
-rw-r--r--.  1 i i 103M Apr 30 03:36 dump_state.bin.lzma2.9.7z
-rw-r--r--.  1 i i 119M Apr 30 03:35 dump_state.bin.rle

Runtime results:

Details

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/save-load-state -m models/7B/ggml-model-q4_0_0.bin
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

You run in a loop of Thought, Action, Observation.
At the end of the loop either Answer or restate your Thought and Action.
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of these actions available to you:
- calculate[python math expression]
Observation will be the result of running those actions


Question: What is 4 * 7 / 3?
Thought: Do I need to use an action? Yes, I use calculate to do math
Action: calculate[4 * 7 / 3]
Observation: 9.3333333333
Thought: Do I need to use an action? No, have the result
Answer: The calculate tool says it is 9.3333333333
Question: What is capital of france?
Thought: Do I need to use an action? No, I know the answer
Answer: Paris is the capital of France
Question: Why


llama_print_timings:        load time = 15344.15 ms
llama_print_timings:      sample time =     2.17 ms /     1 runs   (    2.17 ms per run)
llama_print_timings: prompt eval time = 15065.48 ms /   228 tokens (   66.08 ms per token)
llama_print_timings:        eval time =   263.82 ms /     1 runs   (  263.82 ms per run)
llama_print_timings:       total time = 17581.37 ms
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
 Why


llama_print_timings:        load time =  5089.16 ms
llama_print_timings:      sample time =     2.08 ms /     1 runs   (    2.08 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =   245.59 ms /     1 runs   (  245.59 ms per run)
llama_print_timings:       total time =  5089.18 ms

@ivanstepanovftw ivanstepanovftw changed the title Compress llama state Compress saved llama state Apr 30, 2023
@ejones
Copy link
Collaborator

ejones commented Apr 30, 2023

Hi! I recently landed prompt save and restore for main in #1169. That includes a session file format and API, and RLE is definitely a logical next step for that! I was also exploring delta encoding, since only a small portion of the kv cache is modified on each eval (which I think corresponds to the per state memory).

@ggerganov
Copy link
Member

It's better to just save the computed KV data (i.e. up to n_past) and ignore the zeroes all together.
This will be smaller and faster than any compression algorithm

@ivanstepanovftw
Copy link
Collaborator Author

But dumped kv_self is not so consistent, it have some islands of non-zero data:

image

ImHex pattern:

//#pragma MIME image/bmp
//#pragma endian little

#define LLAMA_MAX_RNG_STATE (64*1024)

struct StateMem {
    u64 rng_size;
    u8 rng_buf[LLAMA_MAX_RNG_STATE];
    
    u64 logits_cap;
    u64 logits_size;
    float logits[logits_cap];
    
    
    u64 embedding_size;
    float embedding[embedding_size];
    
    u64 kv_size;
    u32 kv_ntok;
    u64 kv_self[kv_size/8];
};

struct State {
    u64 state_size;
    StateMem state_mem;
    u64 n_past;
    u64 last_n_tokens_size;
    u32 last_n_tokens[last_n_tokens_size];
};

State state @ 0x00;

@Green-Sky
Copy link
Collaborator

Green-Sky commented Apr 30, 2023

It's better to just save the computed KV data (i.e. up to n_past) and ignore the zeroes all together.
This will be smaller and faster than any compression algorithm

this

But dumped kv_self is not so consistent, it have some islands of non-zero data:

yes, k and v are not interleaved , so either has to start in the middle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants