Skip to content

System freeze when compiled with cublast #1231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
klosax opened this issue Apr 29, 2023 · 10 comments · Fixed by #1233
Closed

System freeze when compiled with cublast #1231

klosax opened this issue Apr 29, 2023 · 10 comments · Fixed by #1233

Comments

@klosax
Copy link
Contributor

klosax commented Apr 29, 2023

When running main compiled with cublast in newest releases (305eb5a), everything works fine until right before returning to command prompt. The timing info pops up then my system completely freeze for about 20 seconds.

Release b1ee8f5 is working fine.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

I cannot reproduce this on my system without more details.

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

Using a 7B model the freeze is about 5 seconds, 30B model 20 seconds.
I tried using --no-map with the 30B model and the system froze for 5 minutes(!) right before displaying the system_info...

I think the problem is the update with cublas pinned host memory, as the freeze seems to appear when initalizing or freeing memory somehow.

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

The prompt eval time is 2.5 times slower also:

Release 305eb5a output:

./main -m ../llama-33b-supercot-ggml-q5_1.bin -c 2048 -p "Hiking is" -n 16 -t 6
main: seed = 1682775656
llama.cpp: loading model from ../llama-33b-supercot-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 25573.12 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size  = 3120.00 MB

(system freeze for 5 min with --no-mmap)

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0


 Hiking is one of the best ways to explore and experience a destination. From leisure
llama_print_timings:        load time =  7378.95 ms
llama_print_timings:      sample time =    10.95 ms /    16 runs   (    0.68 ms per run)
llama_print_timings: prompt eval time =  5005.25 ms /     5 tokens ( 1001.05 ms per token)
llama_print_timings:        eval time =  9425.01 ms /    15 runs   (  628.33 ms per run)
llama_print_timings:       total time = 16818.56 ms

(system freeze for 20 sec with mmap)

Release b1ee8f5 output:

./main -m ../llama-33b-supercot-ggml-q5_1.bin -c 2048 -p "Hiking is" -n 16 -t 6
main: seed = 1682775744
llama.cpp: loading model from ../llama-33b-supercot-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 25573.12 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size  = 3120.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0


 Hiking is one of the best ways to experience nature. There’s nothing quite like tre
llama_print_timings:        load time =  3253.56 ms
llama_print_timings:      sample time =     9.46 ms /    16 runs   (    0.59 ms per run)
llama_print_timings: prompt eval time =  1972.92 ms /     5 tokens (  394.58 ms per token)
llama_print_timings:        eval time =  9242.09 ms /    15 runs   (  616.14 ms per run)
llama_print_timings:       total time = 12505.32 ms

@slaren
Copy link
Member

slaren commented Apr 29, 2023

I can't really do much about it if I am not able to reproduce it. Some search indicates that they were able to solve this by "reinstalled everything from scratch".

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

Thanks. So it seems to be related to Ubuntu and / or AMD cpus. I'm running Ubuntu 20.04 and have an AMD Ryzen 5 cpu.

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

I found out what the problem is. The model did not fit into RAM. When using the b1ee8f5 release it works even if the model dont fit in RAM, but when using the new 305eb5a release with cublas pinned host memory my system will completely freeze. I suggest implementing a memory check at start to determinate if this new mode should be enabled or not.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

It must fit into RAM to use pinned memory at all, as this is memory that cannot be swapped. I can see this happening if you just have barely enough memory to fit the model, and everything else is forced to move into the swap; but if that was the case, I would expect that the slow operation would be the alloc, not the free. Maybe it is just slowly bringing back the shell memory from the swap after the program ends.

I am not convinced that we should do anything about it either way, or even if we can do anything about it. If you try to use a program that requires more memory than your system has, it is not unexpected that things will fail to work, or will work very slowly.

In any case, I am open to suggestions about how to handle this, just checking if there is enough memory is not nearly as easy to do as you are implying here.

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

Maybe implement a parameter to not use pinned memory, as the previous version did work fine on swapped memory.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

I have added an environment variable GGML_CUDA_NO_PINNED that you can set to disable pinned memory in PR #1233.

@klosax
Copy link
Contributor Author

klosax commented Apr 29, 2023

Great! :)

@klosax klosax closed this as completed Apr 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants