llama : add thread safety test #14035

slaren · 2025-06-05T17:03:27Z

Basic thread safety tests that loads a copy of the model on each GPU and CPU, and runs inference with multiple contexts in different threads.

llama : ignore main_gpu <= 0 if there are no GPUs ggml-ci

ci/run.sh

ggerganov · 2025-06-06T08:56:38Z

Maybe we can use an even smaller model for this test:

https://huggingface.co/ggml-org/models/tree/main/tinyllamas

ggml-ci

slaren · 2025-06-06T11:07:58Z

The SYCL ggml-ci does not seem to have libcurl installed yet.

ggerganov · 2025-06-06T11:10:27Z

Should be installed now.

ggml-ci

slaren · 2025-06-06T12:22:53Z

There is some issue with this model (stories15M-q4_0.gguf) on CPU, but I don't think it is a threading issue. Only seems to happen on CPUs with AVX512.

test-thread-safety: /home/ggml/work/llama.cpp/ggml/src/ggml-cpu/ops.cpp:2934: void ggml_compute_forward_silu_f32(const ggml_compute_params*, ggml_tensor*): Assertion `!isnan(x)' failed.

ggerganov · 2025-06-06T12:30:18Z

I looked into it a bit and it does not seem to happen if OpenMP is disabled. Think it is something related to the repacking, but didn't confirm. I'll take an extra look now.

ggerganov · 2025-06-06T13:16:10Z

Pretty sure this is a data-race because the chunk counter will be shared by all contexts:

llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp

Lines 395 to 398 in 487a5e0

    
           template <int RM, int RN, int BM> 
        
           NOINLINE void gemm(int64_t m, int64_t n, int64_t BN) { 
        
               static std::atomic<int64_t> current_chunk;

If I disable GGML_LLAMAFILE on ggml-2 the test works correctly even with OpenMP enabled.

@Djip007 Could you take a look and propose a fix?

ggml-ci

ggml/src/ggml-cpu/llamafile/sgemm.cpp

ggml-ci

ggml/src/ggml-cpu/ggml-cpu.c

ggml-ci

slaren · 2025-06-06T14:33:47Z

19: 0.00.068.420 E common_download_file_single: invalid http status code received: 429

429 is "too many requests". @ngxson do you know if it is a temporary issue with huggingface, or are we being throttled?

ngxson · 2025-06-06T14:38:52Z

HF backend currently has a problem, the team is investigating, should be back very soon

ggml-ci

slaren · 2025-06-06T20:31:15Z

@0cc4m @jeffbolznv The Vulkan backend is crashing on this test. It happens even with a single context per model (-np 1), which is not great because it would prevent, for example, evaluating a draft model simultaneously with the main model. I can hold merging this if you think it could be fixed in the near future, otherwise it might be better to disable the Vulkan CI tests for now.

0cc4m · 2025-06-06T20:36:15Z

It is known that the Vulkan backend is not thread-safe yet, yes.

jeffbolznv · 2025-06-09T14:51:57Z

Are you planning to disable all Vulkan CI coverage due to this one failing test?

slaren · 2025-06-09T14:56:53Z

I don't think that disabling the tests is the best option, but if I don't do that people are going to complain that the CI is failing on every PR. I guess I could disable just this test on the Vulkan CI, but that will just make it easier to ignore this bug.

jeffbolznv · 2025-06-09T16:14:32Z

@0cc4m as a short term fix would it be crazy to just grab a mutex in most/all the ggml backend entry points? I tried that and it sort of works... I still get corruption sometimes in the output. But then again, I also get corruption sometimes using the CUDA backend so I'm not sure if this is the fault of ggml-vulkan.

With the CUDA backend I sometimes get errors like:

C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed

or

ggml_cuda_compute_forward: MUL_MAT failed
C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
CUDA error: operation failed due to a previous error during capture
  current device: 0, in function ggml_cuda_compute_forward at C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2366

My command line is test-thread-safety.exe -m c:\models\llama-2-7b.Q4_0.gguf -ngl 99 -p "The meaning of life is" -n 128 -c 256 -ub 32 -np 4.

slaren · 2025-06-09T17:53:45Z

With the CUDA backend I sometimes get errors

The CUDA errors that I could reproduce should have been fixed in #14033. There may be other issues still, but none that I could reproduce on my system.

jeffbolznv · 2025-06-09T18:11:27Z

Is that fixes included in this branch? I just fetched pull/14035/head. If not, can you rebase?

slaren · 2025-06-09T18:13:54Z

It should be merged now, it wasn't before.

jeffbolznv · 2025-06-09T18:54:43Z

Strange, I rebuilt and I'm still seeing the same failures at about the same rate (maybe 1 in 4 attempts). Which operation it says fails looks random.

0cc4m · 2025-06-10T06:24:32Z

@0cc4m as a short term fix would it be crazy to just grab a mutex in most/all the ggml backend entry points? I tried that and it sort of works... I still get corruption sometimes in the output. But then again, I also get corruption sometimes using the CUDA backend so I'm not sure if this is the fault of ggml-vulkan.

If that helps, we could do that, but the problem is that not all relevant resources of the backend are stored in relation to the backend context yet, so multiple contexts can use the same descriptors, for example. It's annoying to shift around these resources in a way that enables this, but maybe it is time to look at it.

slaren · 2025-06-10T12:03:43Z

I was able to reproduce the CUDA issue. It only happens with the additional instance of the model that is (intended to be) run on the CPU only. I had done most of my testing before adding that instance and I didn't expect it to cause issues the CUDA since the goal was mainly to test llama.cpp, so I didn't catch it before.

I tried a few things, but even with CUDA_LAUNCH_BLOCKING=1 and a global mutex on every ggml-backend function, it still crashes in the same way, so at this point I am out of ideas. It seems likely that it is some issue in CUDA related to graph capture when using multiple GPUs in the same thread. @agray3 mentioned that he already passed the issue to the CUDA graphs team, and in the meanwhile there is a workaround by building with CUDA graphs disabled.

danilabagroff · 2025-06-10T12:34:47Z

Strange, I rebuilt and I'm still seeing the same failures at about the same rate (maybe 1 in 4 attempts). Which operation it says fails looks random.

After a hundred test runs, I can put in my two penny worth: I have commented out two lines to reduce the number of simultaneously running threads and avoid thread starvation:

    /// @brief Focus just on GPU
    const int num_models = gpu_dev_count;
...
    for (int m = 0; m < num_models; ++m) {
          /// @brief Let's tune this via args (like -ngl)
 //       mparams.split_mode = LLAMA_SPLIT_MODE_NONE; 
 //       mparams.main_gpu = m < gpu_dev_count ? m : -1;

Run test-thread-safety on:

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake) 15 cores)
version: 5612 (29020e6b)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

...with args:

./test-thread-safety -m ~/models/SmolLM2-360M-Instruct-BF16.gguf -np 12 -p "Hello, my name is" -n 100 -ngl 99

... to finally abort:

Starting program: /home/ubuntu/builds/llama.cpp/debug/install/bin/test-thread-safety -m ~/models/SmolLM2-360M-Instruct-BF16.gguf -np 12 -p "Hello, my name is" -n 100 -ngl 99
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffb785b000 (LWP 45686)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake))
[New Thread 0x7fffb5bdf000 (LWP 45698)]
build: 5612 (29020e6b) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu (debug)
system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
[New Thread 0x7fffb53de000 (LWP 45699)]
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A2) - 14913 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 290 tensors from /home/ubuntu/models/SmolLM2-360M-Instruct-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolLM2 360M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolLM2
llama_model_loader: - kv   5:                         general.size_label str              = 360M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,4]       = ["safetensors", "onnx", "transformers...
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 8192
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 32
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 49152
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  30:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type bf16:  225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 690.24 MiB (16.00 BPW) 
load: special tokens cache size = 17
load: token to piece cache size = 0.3170 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 361.82 M
print_info: general.name     = SmolLM2 360M Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49152
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 2 '<|im_end|>'
print_info: EOT token        = 0 '<|endoftext|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM REP token    = 4 '<reponame>'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 4 '<reponame>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    90.00 MiB
load_tensors:        CUDA0 model buffer size =   690.24 MiB
...............................................................................
[New Thread 0x7fffa8dde000 (LWP 45700)]
Creating context 1/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
[New Thread 0x7fffa1fff000 (LWP 45701)]
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
Creating context 2/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
[New Thread 0x7fffa17fe000 (LWP 45702)]
Creating context 3/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fffa0ffd000 (LWP 45703)]
Creating context 4/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff99ff0000 (LWP 45704)]
Creating context 5/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff997ef000 (LWP 45705)]
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
Creating context 6/12 for model 1/1
[New Thread 0x7fff98fee000 (LWP 45706)]
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
Creating context 7/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
[New Thread 0x7fff79fff000 (LWP 45707)]
Creating context 8/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff797fe000 (LWP 45708)]
Creating context 9/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
[New Thread 0x7fff78ffd000 (LWP 45709)]
Creating context 10/12 for model 1/1
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
[New Thread 0x7fff75fff000 (LWP 45710)]
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
Creating context 11/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
[New Thread 0x7fff757fe000 (LWP 45711)]
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
Creating context 12/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
Model 1/1, Context 1/12: Result: 'Hello, my name is [Your Name] and I will be your new [Your Position].'
Model 1/1, Context 3/12: Result: 'Hello, my name is [Your Name]. How can I help you today?'
[Thread 0x7fffa8dde000 (LWP 45700) exited]
[Thread 0x7fffa17fe000 (LWP 45702) exited]
Model 1/1, Context 11/12: Result: 'Hello, my name is [Name]. I'm here to help you with your programming needs. What programming language are you using?'
/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: operation failed due to a previous error during capture
  current device: 0, in function ggml_cuda_compute_forward at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
  err

Thread 8 "test-thread-saf" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa0ffd000 (LWP 45703)]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44	./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff5a4527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff5a288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff610f836 in ggml_abort (file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff681371d "CUDA error") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
#6  0x00007ffff62e7e37 in ggml_cuda_error (stmt=0x7ffff681573c "err", func=0x7ffff6815713 "ggml_cuda_compute_forward", file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=2366, msg=0x7ffff4e97910 "operation failed due to a previous error during capture")
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
#7  0x00007ffff62f1fbd in ggml_cuda_compute_forward (ctx=..., dst=0x7ffeb0902f60) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
#8  0x00007ffff62f33df in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff84000e20, cgraph=0x7fff841df250, graph_evaluated_or_captured=@0x7fffa0fda08b: false, use_cuda_graph=@0x7fffa0fda089: true, cuda_graph_update_required=@0x7fffa0fda08a: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
#9  0x00007ffff62f3a88 in ggml_backend_cuda_graph_compute (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
#10 0x00007ffff6127a02 in ggml_backend_graph_compute_async (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
#11 0x00007ffff612bb6d in ggml_backend_sched_compute_splits (sched=0x7fff8401cc90) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
#12 0x00007ffff612c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff8401cc90, graph=0x7ffeb06fb030) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
#13 0x00007ffff7b8ad95 in llama_context::graph_compute (this=0x7fff84000b70, gf=0x7ffeb06fb030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1412
#14 0x00007ffff7b875c0 in llama_context::process_ubatch (this=0x7fff84000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff8424d770, ret=@0x7fffa0fda2fc: 32767) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:710
#15 0x00007ffff7b88fdd in llama_context::decode (this=0x7fff84000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1051
#16 0x00007ffff7b8fbd9 in llama_decode (ctx=0x7fff84000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2812
#17 0x00005555555d5b9f in operator() (__closure=0x555556c500f8) at /home/ubuntu/sources/llama.cpp/tests/test-thread-safety.cpp:120
#18 0x00005555555d6c98 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#19 0x00005555555d6c5b in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#20 0x00005555555d6c08 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:292
#21 0x00005555555d6bdc in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:299
#22 0x00005555555d6bc0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555556c500f0) at /usr/include/c++/13/bits/std_thread.h:244
#23 0x00007ffff5eecdb4 in std::execute_native_thread_routine (__p=0x555556c500f0) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
#24 0x00007ffff5a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#25 0x00007ffff5b29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) bt full
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
        pd = <optimized out>
        old_mask = <optimized out>
        ret = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
        resultvar = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        __private = <optimized out>
        __oldval = <optimized out>
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
No locals.
#3  0x00007ffff5a4527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#4  0x00007ffff5a288ff in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#5  0x00007ffff610f836 in ggml_abort (file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff681371d "CUDA error") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
        args = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 0x7fffa0fd9f70, reg_save_area = 0x7fffa0fd9eb0}}
#6  0x00007ffff62e7e37 in ggml_cuda_error (stmt=0x7ffff681573c "err", func=0x7ffff6815713 "ggml_cuda_compute_forward", file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=2366, msg=0x7ffff4e97910 "operation failed due to a previous error during capture")
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
        id = 0
#7  0x00007ffff62f1fbd in ggml_cuda_compute_forward (ctx=..., dst=0x7ffeb0902f60) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
        err_ = cudaErrorStreamCaptureInvalidated
        err = cudaErrorStreamCaptureInvalidated
        __func__ = "ggml_cuda_compute_forward"
#8  0x00007ffff62f33df in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff84000e20, cgraph=0x7fff841df250, graph_evaluated_or_captured=@0x7fffa0fda08b: false, use_cuda_graph=@0x7fffa0fda089: true, cuda_graph_update_required=@0x7fffa0fda08a: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
        node = 0x7ffeb0902f60
        ok = true
        i = 41
        integrated = false
        __PRETTY_FUNCTION__ = "void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&)"
        __func__ = "evaluate_and_capture_cuda_graph"
#9  0x00007ffff62f3a88 in ggml_backend_cuda_graph_compute (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
        cuda_ctx = 0x7fff84000e20
        disable_cuda_graphs_due_to_env = false
        use_cuda_graph = true
        cuda_graph_update_required = true
        __func__ = "ggml_backend_cuda_graph_compute"
        graph_evaluated_or_captured = false
#10 0x00007ffff6127a02 in ggml_backend_graph_compute_async (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
No locals.
#11 0x00007ffff612bb6d in ggml_backend_sched_compute_splits (sched=0x7fff8401cc90) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
        ec = GGML_STATUS_SUCCESS
        split = 0x7fff841df1e8
        split_backend_id = 0
        split_backend = 0x7fff84001360
        i = 1
        splits = 0x7fff841df130
#12 0x00007ffff612c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff8401cc90, graph=0x7ffeb06fb030) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
No locals.
#13 0x00007ffff7b8ad95 in llama_context::graph_compute (this=0x7fff84000b70, gf=0x7ffeb06fb030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1412
        n_threads = 16
        tp = 0x0
        status = 32767
        __func__ = "graph_compute"
#14 0x00007ffff7b875c0 in llama_context::process_ubatch (this=0x7fff84000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff8424d770, ret=@0x7fffa0fda2fc: 32767) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:710
        __func__ = "process_ubatch"
        gf = 0x7ffeb06fb030
--Type <RET> for more, q to quit, c to continue without paging--
        res = std::unique_ptr<llm_graph_result_i> = {get() = 0x7fff8424cfd0}
        status = GGML_STATUS_SUCCESS
#15 0x00007ffff7b88fdd in llama_context::decode (this=0x7fff84000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1051
        status = 32767
        ubatch = @0x7fff841e2020: {equal_seqs = false, n_tokens = 1, n_seq_tokens = 1, n_seqs = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, output = 0x7fff8424cf90 "\001\222\334{\370\177"}
        res = std::unique_ptr<llm_graph_result_i> = {get() = 0x7fff841e2020}
        t_logits = 0x7ffff7ce7204 <std::_Vector_base<double, std::allocator<double> >::~_Vector_base()+78>
        t_embd = 0x0
        __func__ = "decode"
        batch_allocr = {batch = {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, logits = 0x7fff84920140 "\001\231\334{\370\177"}, seq_id_0 = {_M_elems = {0}}, pos = std::vector of length 1, capacity 1 = {32}, n_seq_id = std::vector of length 1, capacity 1 = {1}, 
          seq_id = std::vector of length 2, capacity 2 = {0x7fffa0fda448, 0x0}, logits = std::vector of length 1, capacity 1 = {1 '\001'}}
        batch = @0x7fffa0fda410: {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, logits = 0x7fff84920140 "\001\231\334{\370\177"}
        vocab = @0x555555fa2458: {pimpl = std::unique_ptr<llama_vocab::impl> = {get() = 0x555555fa25e0}}
        hparams = @0x555555fa0928: {vocab_only = false, rope_finetuned = false, use_par_res = false, swin_norm = false, n_ctx_train = 8192, n_embd = 960, n_embd_features = 0, n_layer = 32, n_rot = 64, n_embd_head_k = 64, n_embd_head_v = 64, n_expert = 0, n_expert_used = 0, n_rel_attn_bkts = 0, n_embd_head_k_mla = 0, n_embd_head_v_mla = 0, posnet = {
            n_embd = 0, n_layer = 0}, convnext = {n_embd = 0, n_layer = 0}, n_head_arr = {_M_elems = {15 <repeats 32 times>, 0 <repeats 480 times>}}, n_head_kv_arr = {_M_elems = {5 <repeats 32 times>, 0 <repeats 480 times>}}, n_ff_arr = {_M_elems = {2560 <repeats 32 times>, 0 <repeats 480 times>}}, n_layer_dense_lead = 0, n_lora_q = 0, n_lora_kv = 0, 
          n_ff_exp = 0, n_ff_shexp = 0, n_expert_shared = 0, n_norm_groups = 0, expert_weights_scale = 0, expert_weights_norm = false, expert_gating_func = 0, moe_every_n_layers = 0, f_norm_eps = 0, f_norm_rms_eps = 9.99999975e-06, f_norm_group_eps = 0, f_attn_logit_softcapping = 50, f_final_logit_softcapping = 30, rescale_every_n_layers = 0, 
          time_mix_extra_dim = 0, time_decay_extra_dim = 0, wkv_head_size = 0, token_shift_count = 2, n_lora_decay = 0, n_lora_iclr = 0, n_lora_value_res_mix = 0, n_lora_gate = 0, rope_attn_factor = 1, rope_freq_base_train = 100000, rope_freq_base_train_swa = 100000, rope_freq_scale_train = 1, rope_freq_scale_train_swa = 1, n_ctx_orig_yarn = 8192, 
          rope_yarn_log_mul = 0, rope_sections = {_M_elems = {0, 0, 0, 0}}, swa_type = LLAMA_SWA_TYPE_NONE, n_swa = 0, swa_layers = {_M_elems = {false <repeats 512 times>}}, ssm_d_conv = 0, ssm_d_inner = 0, ssm_d_state = 0, ssm_dt_rank = 0, ssm_dt_b_c_rms = false, f_clamp_kqv = 0, f_max_alibi_bias = 0, f_logit_scale = 0, f_residual_scale = 0, 
          f_embedding_scale = 0, f_attention_scale = 0, causal_attn = true, use_alibi = false, attn_soft_cap = false, use_kq_norm = true, n_cls_out = 1, n_moe_layer_step = 0, n_no_rope_layer_step = 4, n_attn_temp_floor_scale = 8192, f_attn_temp_scale = 0.100000001, dec_start_token_id = -1, pooling_type = LLAMA_POOLING_TYPE_NONE, 
          rope_type = LLAMA_ROPE_TYPE_NORM, rope_scaling_type_train = LLAMA_ROPE_SCALING_TYPE_LINEAR}
        n_vocab = 49152
        n_tokens_all = 1
        n_embd = 960
        embd_pooled = false
        n_outputs_all = 1
        did_optimize = false
        mstate = std::unique_ptr<llama_memory_state_i> = {get() = 0x7fff8424d770}
        n_outputs_prev = 0
#16 0x00007ffff7b8fbd9 in llama_decode (ctx=0x7fff84000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2812
        ret = 32767
        __func__ = "llama_decode"
#17 0x00005555555d5b9f in operator() (__closure=0x555556c500f8) at /home/ubuntu/sources/llama.cpp/tests/test-thread-safety.cpp:120
        token = 198
        i = 27
        ctx = std::unique_ptr<llama_context> = {get() = 0x7fff84000b70}
        vocab = 0x555555fa2458
        sampler = std::unique_ptr<common_sampler> = {get() = 0x555556c52110}
        batch = {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0, logits = 0x0}
        result = "Hello, my name is [Your Name]. I'm a [Job Title] and I'll be starting [Date] as a [Employer's Name].\n"
        model = 0x555555fa0900
        c = 3
        m = 0
        num_contexts = @0x7fffffffcb4c: 12
        num_models = @0x7fffffffcb48: 1
        cparams = @0x7fffffffcc60: {n_ctx = 4096, n_batch = 2048, n_ubatch = 512, n_seq_max = 12, n_threads = 16, n_threads_batch = 16, rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED, attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED, rope_freq_base = 0, rope_freq_scale = 0, yarn_ext_factor = -1, 
          yarn_attn_factor = 1, yarn_beta_fast = 32, yarn_beta_slow = 1, yarn_orig_ctx = 0, defrag_thold = 0.100000001, cb_eval = 0x0, cb_eval_user_data = 0x0, type_k = GGML_TYPE_F16, type_v = GGML_TYPE_F16, abort_callback = 0x0, abort_callback_data = 0x0, embeddings = false, offload_kqv = true, flash_attn = false, no_perf = false, op_offload = true, 
          swa_full = false}
        failed = std::atomic<bool> = { false }
        params = @0x7fffffffcd00: {n_predict = 100, n_ctx = 4096, n_batch = 2048, n_ubatch = 512, n_keep = 0, n_chunks = -1, n_parallel = 12, n_sequences = 1, grp_attn_n = 1, grp_attn_w = 512, n_print = -1, rope_freq_base = 0, rope_freq_scale = 0, yarn_ext_factor = -1, yarn_attn_factor = 1, yarn_beta_fast = 32, yarn_beta_slow = 1, yarn_orig_ctx = 0, 
          defrag_thold = 0.100000001, devices = std::vector of length 0, capacity 0, n_gpu_layers = 99, main_gpu = 0, tensor_split = {0 <repeats 128 times>}, split_mode = LLAMA_SPLIT_MODE_LAYER, cpuparams = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, 
          cpuparams_batch = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, cb_eval = 0x0, cb_eval_user_data = 0x0, numa = GGML_NUMA_STRATEGY_DISABLED, rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED, 
          attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED, sampling = {seed = 4294967295, n_prev = 64, n_probs = 0, min_keep = 0, top_k = 40, top_p = 0.949999988, min_p = 0.0500000007, xtc_probability = 0, xtc_threshold = 0.100000001, typ_p = 1, temp = 0.800000012, dynatemp_range = 0, dynatemp_exponent = 1, penalty_last_n = 64, penalty_repeat = 1, 
            penalty_freq = 0, penalty_present = 0, dry_multiplier = 0, dry_base = 1.75, dry_allowed_length = 2, dry_penalty_last_n = -1, mirostat = 0, top_n_sigma = -1, mirostat_tau = 5, mirostat_eta = 0.100000001, ignore_eos = false, no_perf = false, timing_per_token = false, dry_sequence_breakers = std::vector of length 4, capacity 4 = {"\n", ":", 
              "\"", "*"}, samplers = std::vector of length 9, capacity 9 = {COMMON_SAMPLER_TYPE_PENALTIES, COMMON_SAMPLER_TYPE_DRY, COMMON_SAMPLER_TYPE_TOP_N_SIGMA, COMMON_SAMPLER_TYPE_TOP_K, COMMON_SAMPLER_TYPE_TYPICAL_P, COMMON_SAMPLER_TYPE_TOP_P, COMMON_SAMPLER_TYPE_MIN_P, COMMON_SAMPLER_TYPE_XTC, COMMON_SAMPLER_TYPE_TEMPERATURE}, grammar = "", 
            grammar_lazy = false, grammar_triggers = std::vector of length 0, capacity 0, preserved_tokens = std::set with 0 elements, logit_bias = std::vector of length 0, capacity 0}, speculative = {devices = std::vector of length 0, capacity 0, n_ctx = 0, n_max = 16, n_min = 0, n_gpu_layers = -1, p_split = 0.100000001, p_min = 0.75, cpuparams = {
              n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, cpuparams_batch = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, model = {path = "", url = "", 
              hf_repo = "", hf_file = ""}}, vocoder = {model = {path = "", url = "", hf_repo = "", hf_file = ""}, speaker_file = "", use_guide_tokens = false}, model = {path = "/home/ubuntu/models/SmolLM2-360M-Instruct-BF16.gguf", url = "", hf_repo = "", hf_file = ""}, model_alias = "", hf_token = "", prompt = "Hello, my name is", system_prompt = "", 
          prompt_file = "", path_prompt_cache = "", input_prefix = "", input_suffix = "", lookup_cache_static = "", lookup_cache_dynamic = "", logits_file = "", in_files = std::vector of length 0, capacity 0, antiprompt = std::vector of length 0, capacity 0, kv_overrides = std::vector of length 0, capacity 0, 
          tensor_buft_overrides = std::vector of length 0, capacity 0, lora_init_without_apply = false, lora_adapters = std::vector of length 0, capacity 0, control_vectors = std::vector of length 0, capacity 0, verbosity = 0, control_vector_layer_start = -1, control_vector_layer_end = -1, offline = false, ppl_stride = 0, ppl_output_type = 0, 
          hellaswag = false, hellaswag_tasks = 400, winogrande = false, winogrande_tasks = 0, multiple_choice = false, multiple_choice_tasks = 0, kl_divergence = false, usage = false, completion = false, use_color = false, special = false, interactive = false, interactive_first = false, prompt_cache_all = false, prompt_cache_ro = false, escape = true, 
          multiline_input = false, simple_io = false, cont_batching = true, flash_attn = false, no_perf = false, ctx_shift = true, swa_full = false, input_prefix_bos = false, use_mmap = true, use_mlock = false, verbose_prompt = false, display_prompt = true, no_kv_offload = false, warmup = true, check_tensors = false, no_op_offload = false, 
          single_turn = false, cache_type_k = GGML_TYPE_F16, cache_type_v = GGML_TYPE_F16, conversation_mode = COMMON_CONVERSATION_MODE_AUTO, mmproj = {path = "", url = "", hf_repo = "", hf_file = ""}, mmproj_use_gpu = true, no_mmproj = false, image = std::vector of length 0, capacity 0, embedding = false, embd_normalize = 2, embd_out = "", 
          embd_sep = "\n", reranking = false, port = 8080, timeout_read = 600, timeout_write = 600, n_threads_http = -1, n_cache_reuse = 0, hostname = "127.0.0.1", public_path = "", chat_template = "", use_jinja = false, enable_chat_template = true, reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK, reasoning_budget = -1, prefill_assistant = true, 
          api_keys = std::vector of length 0, capacity 0, ssl_file_key = "", ssl_file_cert = "", webui = true, endpoint_slots = false, endpoint_props = false, endpoint_metrics = false, log_json = false, slot_save_path = "", slot_prompt_similarity = 0.5, is_pp_shared = false, n_pp = std::vector of length 0, capacity 0, 
          n_tg = std::vector of length 0, capacity 0, n_pl = std::vector of length 0, capacity 0, context_files = std::vector of length 0, capacity 0, chunk_size = 64, chunk_separator = "\n", n_junk = 250, i_pos = -1, n_out_freq = 10, n_save_freq = 0, i_chunk = 0, process_output = false, compute_ppl = true, parse_special = false, n_pca_batch = 100, 
          n_pca_iterations = 1000, cvector_dimre_method = DIMRE_METHOD_PCA, cvector_positive_file = "tools/cvector-generator/positive.txt", cvector_negative_file = "tools/cvector-generator/negative.txt", spm_infill = false, batched_bench_output_jsonl = false, out_file = "", load_progress_callback = 0x0, load_progress_callback_user_data = 0x0}
#18 0x00005555555d6c98 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
No locals.
#19 0x00005555555d6c5b in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
No locals.
#20 0x00005555555d6c08 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:292
No locals.
#21 0x00005555555d6bdc in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:299
No locals.
#22 0x00005555555d6bc0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555556c500f0) at /usr/include/c++/13/bits/std_thread.h:244
No locals.
#23 0x00007ffff5eecdb4 in std::execute_native_thread_routine (__p=0x555556c500f0) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
        __t = <optimized out>
#24 0x00007ffff5a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
        ret = <optimized out>
        pd = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735894507520, 7532769655212744952, 140735894507520, -160, 2, 140737488341088, 7532769655225327864, 7532671421702873336}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#25 0x00007ffff5b29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
No locals.
(gdb) info threads
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff4c84000 (LWP 45685) "test-thread-saf" 0x00007ffff5a98d71 in __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=45701, futex_word=0x7fffa1fff2d0) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffb785b000 (LWP 45686) "cuda00001400006" 0x00007ffff5b1b4cd in __GI___poll (fds=0x555555f25fb0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fffb5bdf000 (LWP 45698) "test-thread-saf" 0x00007ffff5a98d71 in __futex_abstimed_wait_common64 (private=32767, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5555558f6cbc <common_log_main()::log+92>) at ./nptl/futex-internal.c:57
  4    Thread 0x7fffb53de000 (LWP 45699) "cuda-EvtHandlr"  0x00007ffff5b1b4cd in __GI___poll (fds=0x7fff9c000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  6    Thread 0x7fffa1fff000 (LWP 45701) "test-thread-saf" 0x00007fffea21a44a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
* 8    Thread 0x7fffa0ffd000 (LWP 45703) "test-thread-saf" __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  9    Thread 0x7fff99ff0000 (LWP 45704) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  10   Thread 0x7fff997ef000 (LWP 45705) "test-thread-saf" 0x00007fffea44f83e in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  11   Thread 0x7fff98fee000 (LWP 45706) "test-thread-saf" 0x00007fffea44f841 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  12   Thread 0x7fff79fff000 (LWP 45707) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  13   Thread 0x7fff797fe000 (LWP 45708) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  14   Thread 0x7fff78ffd000 (LWP 45709) "test-thread-saf" 0x00007fffea44f80d in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  15   Thread 0x7fff75fff000 (LWP 45710) "test-thread-saf" 0x00007fffea44f841 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  16   Thread 0x7fff757fe000 (LWP 45711) "test-thread-saf" 0x00007fffea44f8f0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1

jeffbolznv · 2025-06-10T13:19:38Z

not all relevant resources of the backend are stored in relation to the backend context yet, so multiple contexts can use the same descriptors, for example.

OK, I'll take a look at this, as a start, and see how far it gets us.

0cc4m · 2025-06-16T06:36:59Z

It seems to be working now with Vulkan, in my tests.

This reverts commit 9381f4e.

when main_gpu < 0 GPU devices are not used

ggml-ci

llama : add thread safety test

e14d9d8

llama : ignore main_gpu <= 0 if there are no GPUs ggml-ci

slaren requested a review from ggerganov as a code owner June 5, 2025 17:03

github-actions bot added testing Everything test related devops improvements to build systems and github actions labels Jun 5, 2025

slaren commented Jun 5, 2025

View reviewed changes

ci/run.sh Show resolved Hide resolved

ggerganov approved these changes Jun 6, 2025

View reviewed changes

use smaller stories15M-q4_0.gguf model

bf45300

ggml-ci

slaren force-pushed the sl/thread-safety-test branch from 2c5874e to a2a0289 Compare June 6, 2025 11:18

test

b046f0c

ggml-ci

slaren force-pushed the sl/thread-safety-test branch from a2a0289 to b046f0c Compare June 6, 2025 12:13

llamafile : remove global state

169774a

ggml-ci

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 6, 2025

slaren commented Jun 6, 2025

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Show resolved Hide resolved

cont : reuse current_chunk from ggml_threadpool

8ef4b95

ggml-ci

slaren commented Jun 6, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu.c Outdated Show resolved Hide resolved

cont : memory order relaxed

03da6c8

ggml-ci

slaren added 4 commits June 6, 2025 20:37

cleanup

00ad177

move context creation to the threads to test it too

292c4e7

ggml-ci

load all models first

2158fd0

Merge remote-tracking branch 'origin/master' into sl/thread-safety-test

29020e6

ggml-ci

slaren mentioned this pull request Jun 7, 2025

Eval bug: Abort is called in a thread from a custom thread pool during a llama_decode call #13990

Open

disable vulkan tests

9381f4e

Merge remote-tracking branch 'origin/master' into sl/thread-safety-test

b839192

slaren added 4 commits June 16, 2025 13:46

Revert "disable vulkan tests"

f422a3e

This reverts commit 9381f4e.

Merge remote-tracking branch 'origin/master' into sl/thread-safety-test

8e11736

llama : better LLAMA_SPLIT_MODE_NONE logic

bbd8b66

when main_gpu < 0 GPU devices are not used

add CPU only test + default split test

5b10edf

ggml-ci

slaren merged commit 6adc3c3 into master Jun 16, 2025
54 of 55 checks passed

slaren deleted the sl/thread-safety-test branch June 16, 2025 15:11

llama : add thread safety test #14035

llama : add thread safety test #14035

Conversation

slaren commented Jun 5, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

slaren commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

slaren commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

slaren commented Jun 6, 2025

Uh oh!

ngxson commented Jun 6, 2025

Uh oh!

slaren commented Jun 6, 2025

Uh oh!

0cc4m commented Jun 6, 2025

Uh oh!

jeffbolznv commented Jun 9, 2025

Uh oh!

slaren commented Jun 9, 2025

Uh oh!

jeffbolznv commented Jun 9, 2025

Uh oh!

slaren commented Jun 9, 2025

Uh oh!

jeffbolznv commented Jun 9, 2025

Uh oh!

slaren commented Jun 9, 2025

Uh oh!

jeffbolznv commented Jun 9, 2025

Uh oh!

0cc4m commented Jun 10, 2025

Uh oh!

slaren commented Jun 10, 2025

Uh oh!

danilabagroff commented Jun 10, 2025

Uh oh!

jeffbolznv commented Jun 10, 2025

Uh oh!

0cc4m commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

slaren commented Jun 6, 2025 •

edited

Loading