Skip to content

Misc. bug: Something recently has broken the -ot option to override model tensor buffers - causes CUDA crash #12798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jukofyork opened this issue Apr 7, 2025 · 8 comments · Fixed by #12891

Comments

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 7, 2025

Name and Version

> llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A2000, compute capability 8.6, VMM: yes
version: 5064 (bd3f59f8)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli, llama-server

Command line

llama-cli -m deepseek-v2-lite-Q8_0.gguf -ot exp=CPU -ngl 99

Problem description & steps to reproduce

Something recently seems to have broken the option to override model tensor buffers added in #11397:

> git clone https://github.com/ggerganov/llama.cpp
> cd llama.cpp
> cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON
> cmake --build build --config Release -- -j 44
> llama-cli -m deepseek-v2-lite-Q8_0.gguf -ot exp=CPU -ngl 99

It successfully processes the prompt, seems to write a single token and then crashes with this:

CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at /home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2437
  cudaStreamSynchronize(cuda_ctx->stream())
  • I have also tested on a full BF16 version of deepseek-v2-lite and it gets the same problem.
  • I have also tested on a Q8_0 of deepseek-r1 and it gets the same problem.

First Bad Commit

Unsure, but recent.

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A2000, compute capability 8.6, VMM: yes
build: 5064 (bd3f59f8) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX 5000 Ada Generation) - 31921 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA RTX A2000) - 5719 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 377 tensors from /home/juk/models/gguf/deepseek-v2-lite-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V2 Lite Chat
llama_model_loader: - kv   3:                           general.finetune str              = Chat
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-V2-Lite
llama_model_loader: - kv   5:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = deepseek
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv   9:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv  10:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  11:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  12:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv  13:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv  14:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv  15:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  18:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  19:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = false
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 1
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q8_0:  269 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 15.55 GiB (8.51 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 2
load: token to piece cache size = 0.6408 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 2048
print_info: n_layer          = 27
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 3072
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 10944
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 16B
print_info: model params     = 15.71 B
print_info: general.name     = DeepSeek V2 Lite Chat
print_info: n_layer_dense_lead   = 1
print_info: n_lora_q             = 0
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 1408
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm  = 0
print_info: expert_gating_func   = softmax
print_info: rope_yarn_log_mul    = 0.0707
print_info: vocab type       = BPE
print_info: n_vocab          = 102400
print_info: n_merges         = 99757
print_info: BOS token        = 100000 '<|begin▁of▁sentence|>'
print_info: EOS token        = 100001 '<|end▁of▁sentence|>'
print_info: EOT token        = 100001 '<|end▁of▁sentence|>'
print_info: PAD token        = 100001 '<|end▁of▁sentence|>'
print_info: LF token         = 185 'Ċ'
print_info: EOG token        = 100001 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 28/28 layers to GPU
load_tensors:        CUDA0 model buffer size =   414.74 MiB
load_tensors:        CUDA1 model buffer size =   255.90 MiB
load_tensors:   CPU_Mapped model buffer size = 15712.44 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.39 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 27, can_shift = 0
init:      CUDA0 KV buffer size =   960.00 MiB
init:      CUDA1 KV buffer size =   120.00 MiB
llama_context: KV self size  = 1080.00 MiB, K (f16):  648.00 MiB, V (f16):  432.00 MiB
llama_context:      CUDA0 compute buffer size =   406.75 MiB
llama_context:      CUDA1 compute buffer size =   204.00 MiB
llama_context:  CUDA_Host compute buffer size =    12.01 MiB
llama_context: graph nodes  = 1951
llama_context: graph splits = 162 (with bs=512), 54 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 36
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

User: Hello

Assistant: Hi there<|end▁of▁sentence|>User: How are you?

Assistant:

system_info: n_threads = 36 (n_threads_batch = 36) / 72 | CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 2289461659
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> tell me a joke about pandas
/home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2437
  cudaStreamSynchronize(cuda_ctx->stream())
CUDA error

Using:

/usr/local/cuda-12.6/bin/compute-sanitizer lama-cli -m ~/models/gguf/deepseek-v2-lite-Q8_0.gguf -ot exp=CPU -ngl 99

gives this as last few sections:

========= Invalid __global__ read of size 8 bytes
=========     at void cpy_f32_f16<&cpy_1_f32_f32>(const char *, char *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, char **, int)+0x160
=========     by thread (33,0,0) in block (1,0,0)
=========     Address 0x7f3933c20420 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f3933c20400 of size 32 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x33ec2d]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x227dc]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:cudaGraphLaunch [0x776fe]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x999a2]
=========                in /home/juk/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x28cf2]
=========                in /home/juk/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:llama_context::graph_compute(ggml_cgraph*, bool) [0x6cf08]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_context::decode(llama_batch&) [0x6fd51]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_decode [0x70fda]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:main [0x377ec]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
=========     Host Frame: [0x27249]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x27304]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x3bb80]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
========= 
========= Invalid __global__ read of size 8 bytes
=========     at void cpy_f32_f16<&cpy_1_f32_f32>(const char *, char *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, char **, int)+0x160
=========     by thread (34,0,0) in block (1,0,0)
=========     Address 0x7f3933c20420 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f3933c20400 of size 32 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x33ec2d]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x227dc]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:cudaGraphLaunch [0x776fe]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x999a2]
=========                in /home/juk/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x28cf2]
=========                in /home/juk/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:llama_context::graph_compute(ggml_cgraph*, bool) [0x6cf08]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_context::decode(llama_batch&) [0x6fd51]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_decode [0x70fda]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:main [0x377ec]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
=========     Host Frame: [0x27249]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x27304]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x3bb80]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
========= 
========= Invalid __global__ read of size 8 bytes
=========     at void cpy_f32_f16<&cpy_1_f32_f32>(const char *, char *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, char **, int)+0x160
=========     by thread (35,0,0) in block (1,0,0)
=========     Address 0x7f3933c20420 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f3933c20400 of size 32 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x33ec2d]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x227dc]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:cudaGraphLaunch [0x776fe]
=========                in /usr/local/cuda/lib64/libcudart.so.12
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x999a2]
=========                in /home/juk/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x28cf2]
=========                in /home/juk/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:llama_context::graph_compute(ggml_cgraph*, bool) [0x6cf08]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_context::decode(llama_batch&) [0x6fd51]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:llama_decode [0x70fda]
=========                in /home/juk/llama.cpp/build/bin/libllama.so
=========     Host Frame:main [0x377ec]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
=========     Host Frame: [0x27249]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x27304]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x3bb80]
=========                in /home/juk/llama.cpp/build/bin/llama-cli
========= 
/home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at /home/juk/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2437
  cudaStreamSynchronize(cuda_ctx->stream())
@slaren
Copy link
Member

slaren commented Apr 7, 2025

Does it work if you build with cuda graphs disabled (GGML_CUDA_GRAPHS=OFF)?

@jukofyork
Copy link
Collaborator Author

Does it work if you build with cuda graphs disabled (GGML_CUDA_GRAPHS=OFF)?

Yeah, this fixes it - thanks!

Tested on both deepseek-v2-lite and deepseek-r1.

@slaren
Copy link
Member

slaren commented Apr 7, 2025

@agray3 any ideas? I think some pointers are not being updated correctly. I can reproduce this reliably with deepseek-v2-lite when running with compute-sanitizer.

This case should disable CUDA graphs completely, since there are multiple different graphs.

@agray3
Copy link
Contributor

agray3 commented Apr 7, 2025 via email

@nicoboss
Copy link
Contributor

nicoboss commented Apr 9, 2025

imatrix computation without any special arguments seams to be affected by this issue as well. Compiling llama.cpp using GGML_CUDA_GRAPHS=OFF fixed this issue for me.

Local imatrix computation:

compute_imatrix: computing over 314 chunks with batch_size 512
/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:2480
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3765900]
[New LWP 3765901]
[New LWP 3765902]
[New LWP 3765903]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007c1c695ef0b4 in ggml_print_backtrace () at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:156
156             waitpid(pid, &wstatus, 0);
#2  ggml_abort (file=0x7c1c620e0f40 "/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7c1c62106e88 "CUDA error") at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:183
183         ggml_print_backtrace();
#3  0x00007c1c61e8d033 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007c1c61e8e59a in ggml_backend_cuda_synchronize(ggml_backend*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007c1c6960434c in ggml_backend_sched_compute_splits (sched=0x60e771a7ca00) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1427
1427    /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp: No such file or directory.
#6  ggml_backend_sched_graph_compute_async (sched=0x60e771a7ca00, graph=<optimized out>) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1590
1590    in /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp
#7  0x00007c1c697226d9 in llama_context::graph_compute (this=this@entry=0x60e771b9a500, gf=gf@entry=0x7c1c226fb030, batched=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:191
191           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#8  0x00007c1c69725522 in llama_context::decode (this=0x60e771b9a500, inp_batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:1329
1329    /llmjob/llama.cpp-cuda512/src/llama-context.cpp: No such file or directory.
#9  0x00007c1c697267ab in llama_decode (ctx=<optimized out>, batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:2792
2792    in /llmjob/llama.cpp-cuda512/src/llama-context.cpp
#10 0x000060e7644af309 in compute_imatrix (params=..., ctx=0x60e771b9a500) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:554
554     /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp: No such file or directory.
#11 main (argc=<optimized out>, argv=<optimized out>) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:686
686     in /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp
[Inferior 1 (process 3765898) detached]

imatrix computation using an RPC server:

root@RPC-GPU:~# GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./run.sh

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.200.201') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Starting RPC server
  endpoint       : 192.168.200.201:7201
  local cache    : n/a
  backend memory : 35 MB
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2465
  cudaStreamSynchronize(cuda_ctx->stream())
/root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
./run.sh: line 8:  5960 Aborted                 ./rpc-server -H 192.168.200.201 -p 7201 -m 35.5

@agray3
Copy link
Contributor

agray3 commented Apr 11, 2025

The issue is that the new ggml_cuda_cpy indirection mechanism does not (yet?) properly support the presence of GGML_OP_DUP or GGML_OP_CONT node types, which call ggml_cuda_cpy() indirectly via ggml_cuda_dup(). A fix is at #12891, which disables CUDA graphs in the presense of these node types.

@slaren this case was previously using CUDA graphs, but only for the first few tokens and then the disable_due_to_too_many_updates mechanism was kicking in. Please let me know if you think we should support CUDA graphs for the above node types (including any example case if possible), in which case some further tweaks will be required to enable, and if you think the current case should be disabled using some other mechanism.

@slaren
Copy link
Member

slaren commented Apr 13, 2025

Thanks @agray3. It definitely would be better to support these nodes since some models use ggml_cont, and it should be a fairly simply change, but it is not critical.

@agray3
Copy link
Contributor

agray3 commented Apr 16, 2025

Thanks @agray3. It definitely would be better to support these nodes since some models use ggml_cont, and it should be a fairly simply change, but it is not critical.

@slaren we already had reports of regressions from users due to this - I've now made the tweaks to re-enable CUDA graphs for these node types at #12970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants