Compress saved llama state #1247

ivanstepanovftw · 2023-04-30T00:44:27Z

Compress saved llama state by using RLE compression. I have also tried lz4 and lzma compression, but it seems like a lot of entropy besides zeroes in the file, so RLE is the best choice for small-mid size prompts. Uncompressed size is 259M, while RLE compressed is 4.4M.

Some parts of the code is generated and needs to be refactored.

If someone is working on implementing prompt cache for examples/main - please let me know.

For our largest prompt reason-act compression results are following:

-rw-r--r--.  1 i i 259M Apr 30 03:35 dump_state.bin
-rw-r--r--.  1 i i 116M Apr 30 03:35 dump_state.bin.9.lz4
-rw-r--r--.  1 i i 116M Apr 30 03:35 dump_state.bin.lz4
-rw-r--r--.  1 i i 103M Apr 30 03:36 dump_state.bin.lzma2.9.7z
-rw-r--r--.  1 i i 119M Apr 30 03:35 dump_state.bin.rle

Runtime results:

Details

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/save-load-state -m models/7B/ggml-model-q4_0_0.bin
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

You run in a loop of Thought, Action, Observation.
At the end of the loop either Answer or restate your Thought and Action.
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of these actions available to you:
- calculate[python math expression]
Observation will be the result of running those actions


Question: What is 4 * 7 / 3?
Thought: Do I need to use an action? Yes, I use calculate to do math
Action: calculate[4 * 7 / 3]
Observation: 9.3333333333
Thought: Do I need to use an action? No, have the result
Answer: The calculate tool says it is 9.3333333333
Question: What is capital of france?
Thought: Do I need to use an action? No, I know the answer
Answer: Paris is the capital of France
Question: Why


llama_print_timings:        load time = 15344.15 ms
llama_print_timings:      sample time =     2.17 ms /     1 runs   (    2.17 ms per run)
llama_print_timings: prompt eval time = 15065.48 ms /   228 tokens (   66.08 ms per token)
llama_print_timings:        eval time =   263.82 ms /     1 runs   (  263.82 ms per run)
llama_print_timings:       total time = 17581.37 ms
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
 Why


llama_print_timings:        load time =  5089.16 ms
llama_print_timings:      sample time =     2.08 ms /     1 runs   (    2.08 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =   245.59 ms /     1 runs   (  245.59 ms per run)
llama_print_timings:       total time =  5089.18 ms

ejones · 2023-04-30T02:44:02Z

Hi! I recently landed prompt save and restore for main in #1169. That includes a session file format and API, and RLE is definitely a logical next step for that! I was also exploring delta encoding, since only a small portion of the kv cache is modified on each eval (which I think corresponds to the per state memory).

ggerganov · 2023-04-30T06:59:52Z

It's better to just save the computed KV data (i.e. up to n_past) and ignore the zeroes all together.
This will be smaller and faster than any compression algorithm

ivanstepanovftw · 2023-04-30T12:09:46Z

But dumped kv_self is not so consistent, it have some islands of non-zero data:

ImHex pattern:

//#pragma MIME image/bmp
//#pragma endian little

#define LLAMA_MAX_RNG_STATE (64*1024)

struct StateMem {
    u64 rng_size;
    u8 rng_buf[LLAMA_MAX_RNG_STATE];
    
    u64 logits_cap;
    u64 logits_size;
    float logits[logits_cap];
    
    
    u64 embedding_size;
    float embedding[embedding_size];
    
    u64 kv_size;
    u32 kv_ntok;
    u64 kv_self[kv_size/8];
};

struct State {
    u64 state_size;
    StateMem state_mem;
    u64 n_past;
    u64 last_n_tokens_size;
    u32 last_n_tokens[last_n_tokens_size];
};

State state @ 0x00;

Green-Sky · 2023-04-30T12:59:27Z

It's better to just save the computed KV data (i.e. up to n_past) and ignore the zeroes all together.
This will be smaller and faster than any compression algorithm

this

But dumped kv_self is not so consistent, it have some islands of non-zero data:

yes, k and v are not interleaved , so either has to start in the middle.

Compress llama state

8e739a0

ivanstepanovftw changed the title ~~Compress llama state~~ Compress saved llama state Apr 30, 2023

ivanstepanovftw closed this Apr 30, 2023

ejones mentioned this pull request May 2, 2023

llama : only copy used KV cache in get / set state #1272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compress saved llama state #1247

Compress saved llama state #1247

Uh oh!

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

Uh oh!

ejones commented Apr 30, 2023

Uh oh!

ggerganov commented Apr 30, 2023

Uh oh!

ivanstepanovftw commented Apr 30, 2023

Uh oh!

Green-Sky commented Apr 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

Compress saved llama state #1247

Compress saved llama state #1247

Uh oh!

Conversation

ivanstepanovftw commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejones commented Apr 30, 2023

Uh oh!

ggerganov commented Apr 30, 2023

Uh oh!

ivanstepanovftw commented Apr 30, 2023

Uh oh!

Green-Sky commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

Green-Sky commented Apr 30, 2023 •

edited

Loading