Skip to content

llama : add thread safety test #14035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 16, 2025
Merged

llama : add thread safety test #14035

merged 16 commits into from
Jun 16, 2025

Conversation

slaren
Copy link
Member

@slaren slaren commented Jun 5, 2025

Basic thread safety tests that loads a copy of the model on each GPU and CPU, and runs inference with multiple contexts in different threads.

llama : ignore main_gpu <= 0 if there are no GPUs

ggml-ci
@slaren slaren requested a review from ggerganov as a code owner June 5, 2025 17:03
@github-actions github-actions bot added testing Everything test related devops improvements to build systems and github actions labels Jun 5, 2025
@ggerganov
Copy link
Member

Maybe we can use an even smaller model for this test:

https://huggingface.co/ggml-org/models/tree/main/tinyllamas

@slaren
Copy link
Member Author

slaren commented Jun 6, 2025

The SYCL ggml-ci does not seem to have libcurl installed yet.

@ggerganov
Copy link
Member

Should be installed now.

@slaren slaren force-pushed the sl/thread-safety-test branch from 2c5874e to a2a0289 Compare June 6, 2025 11:18
ggml-ci
@slaren slaren force-pushed the sl/thread-safety-test branch from a2a0289 to b046f0c Compare June 6, 2025 12:13
@slaren
Copy link
Member Author

slaren commented Jun 6, 2025

There is some issue with this model (stories15M-q4_0.gguf) on CPU, but I don't think it is a threading issue. Only seems to happen on CPUs with AVX512.

test-thread-safety: /home/ggml/work/llama.cpp/ggml/src/ggml-cpu/ops.cpp:2934: void ggml_compute_forward_silu_f32(const ggml_compute_params*, ggml_tensor*): Assertion `!isnan(x)' failed.

@ggerganov
Copy link
Member

I looked into it a bit and it does not seem to happen if OpenMP is disabled. Think it is something related to the repacking, but didn't confirm. I'll take an extra look now.

@ggerganov
Copy link
Member

Pretty sure this is a data-race because the chunk counter will be shared by all contexts:

template <int RM, int RN, int BM>
NOINLINE void gemm(int64_t m, int64_t n, int64_t BN) {
static std::atomic<int64_t> current_chunk;

If I disable GGML_LLAMAFILE on ggml-2 the test works correctly even with OpenMP enabled.

@Djip007 Could you take a look and propose a fix?

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 6, 2025
@slaren
Copy link
Member Author

slaren commented Jun 6, 2025

19: 0.00.068.420 E common_download_file_single: invalid http status code received: 429

429 is "too many requests". @ngxson do you know if it is a temporary issue with huggingface, or are we being throttled?

@ngxson
Copy link
Collaborator

ngxson commented Jun 6, 2025

HF backend currently has a problem, the team is investigating, should be back very soon

@slaren
Copy link
Member Author

slaren commented Jun 6, 2025

@0cc4m @jeffbolznv The Vulkan backend is crashing on this test. It happens even with a single context per model (-np 1), which is not great because it would prevent, for example, evaluating a draft model simultaneously with the main model. I can hold merging this if you think it could be fixed in the near future, otherwise it might be better to disable the Vulkan CI tests for now.

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 6, 2025

It is known that the Vulkan backend is not thread-safe yet, yes.

@jeffbolznv
Copy link
Collaborator

Are you planning to disable all Vulkan CI coverage due to this one failing test?

@slaren
Copy link
Member Author

slaren commented Jun 9, 2025

I don't think that disabling the tests is the best option, but if I don't do that people are going to complain that the CI is failing on every PR. I guess I could disable just this test on the Vulkan CI, but that will just make it easier to ignore this bug.

@jeffbolznv
Copy link
Collaborator

@0cc4m as a short term fix would it be crazy to just grab a mutex in most/all the ggml backend entry points? I tried that and it sort of works... I still get corruption sometimes in the output. But then again, I also get corruption sometimes using the CUDA backend so I'm not sure if this is the fault of ggml-vulkan.

With the CUDA backend I sometimes get errors like:

C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed

or

ggml_cuda_compute_forward: MUL_MAT failed
C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
CUDA error: operation failed due to a previous error during capture
  current device: 0, in function ggml_cuda_compute_forward at C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2366

My command line is test-thread-safety.exe -m c:\models\llama-2-7b.Q4_0.gguf -ngl 99 -p "The meaning of life is" -n 128 -c 256 -ub 32 -np 4.

@slaren
Copy link
Member Author

slaren commented Jun 9, 2025

With the CUDA backend I sometimes get errors

The CUDA errors that I could reproduce should have been fixed in #14033. There may be other issues still, but none that I could reproduce on my system.

@jeffbolznv
Copy link
Collaborator

Is that fixes included in this branch? I just fetched pull/14035/head. If not, can you rebase?

@slaren
Copy link
Member Author

slaren commented Jun 9, 2025

It should be merged now, it wasn't before.

@jeffbolznv
Copy link
Collaborator

Strange, I rebuilt and I'm still seeing the same failures at about the same rate (maybe 1 in 4 attempts). Which operation it says fails looks random.

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 10, 2025

@0cc4m as a short term fix would it be crazy to just grab a mutex in most/all the ggml backend entry points? I tried that and it sort of works... I still get corruption sometimes in the output. But then again, I also get corruption sometimes using the CUDA backend so I'm not sure if this is the fault of ggml-vulkan.

If that helps, we could do that, but the problem is that not all relevant resources of the backend are stored in relation to the backend context yet, so multiple contexts can use the same descriptors, for example. It's annoying to shift around these resources in a way that enables this, but maybe it is time to look at it.

@slaren
Copy link
Member Author

slaren commented Jun 10, 2025

I was able to reproduce the CUDA issue. It only happens with the additional instance of the model that is (intended to be) run on the CPU only. I had done most of my testing before adding that instance and I didn't expect it to cause issues the CUDA since the goal was mainly to test llama.cpp, so I didn't catch it before.

I tried a few things, but even with CUDA_LAUNCH_BLOCKING=1 and a global mutex on every ggml-backend function, it still crashes in the same way, so at this point I am out of ideas. It seems likely that it is some issue in CUDA related to graph capture when using multiple GPUs in the same thread. @agray3 mentioned that he already passed the issue to the CUDA graphs team, and in the meanwhile there is a workaround by building with CUDA graphs disabled.

@danilabagroff
Copy link

Strange, I rebuilt and I'm still seeing the same failures at about the same rate (maybe 1 in 4 attempts). Which operation it says fails looks random.

After a hundred test runs, I can put in my two penny worth: I have commented out two lines to reduce the number of simultaneously running threads and avoid thread starvation:

    /// @brief Focus just on GPU
    const int num_models = gpu_dev_count;
...
    for (int m = 0; m < num_models; ++m) {
          /// @brief Let's tune this via args (like -ngl)
 //       mparams.split_mode = LLAMA_SPLIT_MODE_NONE; 
 //       mparams.main_gpu = m < gpu_dev_count ? m : -1;

Run test-thread-safety on:

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake) 15 cores)
version: 5612 (29020e6b)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

...with args:

./test-thread-safety -m ~/models/SmolLM2-360M-Instruct-BF16.gguf -np 12 -p "Hello, my name is" -n 100 -ngl 99

... to finally abort:

Starting program: /home/ubuntu/builds/llama.cpp/debug/install/bin/test-thread-safety -m ~/models/SmolLM2-360M-Instruct-BF16.gguf -np 12 -p "Hello, my name is" -n 100 -ngl 99
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffb785b000 (LWP 45686)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A2, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A2)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel Xeon Processor (Icelake))
[New Thread 0x7fffb5bdf000 (LWP 45698)]
build: 5612 (29020e6b) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu (debug)
system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
[New Thread 0x7fffb53de000 (LWP 45699)]
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A2) - 14913 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 290 tensors from /home/ubuntu/models/SmolLM2-360M-Instruct-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolLM2 360M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolLM2
llama_model_loader: - kv   5:                         general.size_label str              = 360M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,4]       = ["safetensors", "onnx", "transformers...
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 8192
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 32
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 49152
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  30:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type bf16:  225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 690.24 MiB (16.00 BPW) 
load: special tokens cache size = 17
load: token to piece cache size = 0.3170 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 361.82 M
print_info: general.name     = SmolLM2 360M Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49152
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 2 '<|im_end|>'
print_info: EOT token        = 0 '<|endoftext|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM REP token    = 4 '<reponame>'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 4 '<reponame>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    90.00 MiB
load_tensors:        CUDA0 model buffer size =   690.24 MiB
...............................................................................
[New Thread 0x7fffa8dde000 (LWP 45700)]
Creating context 1/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
[New Thread 0x7fffa1fff000 (LWP 45701)]
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
Creating context 2/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
[New Thread 0x7fffa17fe000 (LWP 45702)]
Creating context 3/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fffa0ffd000 (LWP 45703)]
Creating context 4/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff99ff0000 (LWP 45704)]
Creating context 5/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff997ef000 (LWP 45705)]
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
Creating context 6/12 for model 1/1
[New Thread 0x7fff98fee000 (LWP 45706)]
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
Creating context 7/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
[New Thread 0x7fff79fff000 (LWP 45707)]
Creating context 8/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
[New Thread 0x7fff797fe000 (LWP 45708)]
Creating context 9/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
[New Thread 0x7fff78ffd000 (LWP 45709)]
Creating context 10/12 for model 1/1
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
[New Thread 0x7fff75fff000 (LWP 45710)]
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
Creating context 11/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
[New Thread 0x7fff757fe000 (LWP 45711)]
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
Creating context 12/12 for model 1/1
llama_context: constructing llama_context
llama_context: n_seq_max     = 12
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 341
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (341) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_context:  CUDA_Host  output buffer size =     2.25 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers, 12 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
llama_context:      CUDA0 compute buffer size =   133.51 MiB
llama_context:  CUDA_Host compute buffer size =     9.85 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
Model 1/1, Context 1/12: Result: 'Hello, my name is [Your Name] and I will be your new [Your Position].'
Model 1/1, Context 3/12: Result: 'Hello, my name is [Your Name]. How can I help you today?'
[Thread 0x7fffa8dde000 (LWP 45700) exited]
[Thread 0x7fffa17fe000 (LWP 45702) exited]
Model 1/1, Context 11/12: Result: 'Hello, my name is [Name]. I'm here to help you with your programming needs. What programming language are you using?'
/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: operation failed due to a previous error during capture
  current device: 0, in function ggml_cuda_compute_forward at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
  err

Thread 8 "test-thread-saf" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa0ffd000 (LWP 45703)]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44	./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff5a4527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff5a288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff610f836 in ggml_abort (file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff681371d "CUDA error") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
#6  0x00007ffff62e7e37 in ggml_cuda_error (stmt=0x7ffff681573c "err", func=0x7ffff6815713 "ggml_cuda_compute_forward", file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=2366, msg=0x7ffff4e97910 "operation failed due to a previous error during capture")
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
#7  0x00007ffff62f1fbd in ggml_cuda_compute_forward (ctx=..., dst=0x7ffeb0902f60) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
#8  0x00007ffff62f33df in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff84000e20, cgraph=0x7fff841df250, graph_evaluated_or_captured=@0x7fffa0fda08b: false, use_cuda_graph=@0x7fffa0fda089: true, cuda_graph_update_required=@0x7fffa0fda08a: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
#9  0x00007ffff62f3a88 in ggml_backend_cuda_graph_compute (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
#10 0x00007ffff6127a02 in ggml_backend_graph_compute_async (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
#11 0x00007ffff612bb6d in ggml_backend_sched_compute_splits (sched=0x7fff8401cc90) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
#12 0x00007ffff612c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff8401cc90, graph=0x7ffeb06fb030) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
#13 0x00007ffff7b8ad95 in llama_context::graph_compute (this=0x7fff84000b70, gf=0x7ffeb06fb030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1412
#14 0x00007ffff7b875c0 in llama_context::process_ubatch (this=0x7fff84000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff8424d770, ret=@0x7fffa0fda2fc: 32767) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:710
#15 0x00007ffff7b88fdd in llama_context::decode (this=0x7fff84000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1051
#16 0x00007ffff7b8fbd9 in llama_decode (ctx=0x7fff84000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2812
#17 0x00005555555d5b9f in operator() (__closure=0x555556c500f8) at /home/ubuntu/sources/llama.cpp/tests/test-thread-safety.cpp:120
#18 0x00005555555d6c98 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#19 0x00005555555d6c5b in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#20 0x00005555555d6c08 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:292
#21 0x00005555555d6bdc in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:299
#22 0x00005555555d6bc0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555556c500f0) at /usr/include/c++/13/bits/std_thread.h:244
#23 0x00007ffff5eecdb4 in std::execute_native_thread_routine (__p=0x555556c500f0) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
#24 0x00007ffff5a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#25 0x00007ffff5b29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) bt full
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
        pd = <optimized out>
        old_mask = <optimized out>
        ret = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
        resultvar = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        __private = <optimized out>
        __oldval = <optimized out>
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
No locals.
#3  0x00007ffff5a4527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#4  0x00007ffff5a288ff in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#5  0x00007ffff610f836 in ggml_abort (file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7ffff681371d "CUDA error") at /home/ubuntu/sources/llama.cpp/ggml/src/ggml.c:221
        args = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 0x7fffa0fd9f70, reg_save_area = 0x7fffa0fd9eb0}}
#6  0x00007ffff62e7e37 in ggml_cuda_error (stmt=0x7ffff681573c "err", func=0x7ffff6815713 "ggml_cuda_compute_forward", file=0x7ffff6813728 "/home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu", line=2366, msg=0x7ffff4e97910 "operation failed due to a previous error during capture")
    at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75
        id = 0
#7  0x00007ffff62f1fbd in ggml_cuda_compute_forward (ctx=..., dst=0x7ffeb0902f60) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
        err_ = cudaErrorStreamCaptureInvalidated
        err = cudaErrorStreamCaptureInvalidated
        __func__ = "ggml_cuda_compute_forward"
#8  0x00007ffff62f33df in evaluate_and_capture_cuda_graph (cuda_ctx=0x7fff84000e20, cgraph=0x7fff841df250, graph_evaluated_or_captured=@0x7fffa0fda08b: false, use_cuda_graph=@0x7fffa0fda089: true, cuda_graph_update_required=@0x7fffa0fda08a: true) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2673
        node = 0x7ffeb0902f60
        ok = true
        i = 41
        integrated = false
        __PRETTY_FUNCTION__ = "void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&)"
        __func__ = "evaluate_and_capture_cuda_graph"
#9  0x00007ffff62f3a88 in ggml_backend_cuda_graph_compute (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2780
        cuda_ctx = 0x7fff84000e20
        disable_cuda_graphs_due_to_env = false
        use_cuda_graph = true
        cuda_graph_update_required = true
        __func__ = "ggml_backend_cuda_graph_compute"
        graph_evaluated_or_captured = false
#10 0x00007ffff6127a02 in ggml_backend_graph_compute_async (backend=0x7fff84001360, cgraph=0x7fff841df250) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:334
No locals.
#11 0x00007ffff612bb6d in ggml_backend_sched_compute_splits (sched=0x7fff8401cc90) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1404
        ec = GGML_STATUS_SUCCESS
        split = 0x7fff841df1e8
        split_backend_id = 0
        split_backend = 0x7fff84001360
        i = 1
        splits = 0x7fff841df130
#12 0x00007ffff612c809 in ggml_backend_sched_graph_compute_async (sched=0x7fff8401cc90, graph=0x7ffeb06fb030) at /home/ubuntu/sources/llama.cpp/ggml/src/ggml-backend.cpp:1596
No locals.
#13 0x00007ffff7b8ad95 in llama_context::graph_compute (this=0x7fff84000b70, gf=0x7ffeb06fb030, batched=false) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1412
        n_threads = 16
        tp = 0x0
        status = 32767
        __func__ = "graph_compute"
#14 0x00007ffff7b875c0 in llama_context::process_ubatch (this=0x7fff84000b70, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mstate=0x7fff8424d770, ret=@0x7fffa0fda2fc: 32767) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:710
        __func__ = "process_ubatch"
        gf = 0x7ffeb06fb030
--Type <RET> for more, q to quit, c to continue without paging--
        res = std::unique_ptr<llm_graph_result_i> = {get() = 0x7fff8424cfd0}
        status = GGML_STATUS_SUCCESS
#15 0x00007ffff7b88fdd in llama_context::decode (this=0x7fff84000b70, inp_batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:1051
        status = 32767
        ubatch = @0x7fff841e2020: {equal_seqs = false, n_tokens = 1, n_seq_tokens = 1, n_seqs = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, output = 0x7fff8424cf90 "\001\222\334{\370\177"}
        res = std::unique_ptr<llm_graph_result_i> = {get() = 0x7fff841e2020}
        t_logits = 0x7ffff7ce7204 <std::_Vector_base<double, std::allocator<double> >::~_Vector_base()+78>
        t_embd = 0x0
        __func__ = "decode"
        batch_allocr = {batch = {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, logits = 0x7fff84920140 "\001\231\334{\370\177"}, seq_id_0 = {_M_elems = {0}}, pos = std::vector of length 1, capacity 1 = {32}, n_seq_id = std::vector of length 1, capacity 1 = {1}, 
          seq_id = std::vector of length 2, capacity 2 = {0x7fffa0fda448, 0x0}, logits = std::vector of length 1, capacity 1 = {1 '\001'}}
        batch = @0x7fffa0fda410: {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x7fff8424d3e0, n_seq_id = 0x7fff8424d320, seq_id = 0x7fff84564b40, logits = 0x7fff84920140 "\001\231\334{\370\177"}
        vocab = @0x555555fa2458: {pimpl = std::unique_ptr<llama_vocab::impl> = {get() = 0x555555fa25e0}}
        hparams = @0x555555fa0928: {vocab_only = false, rope_finetuned = false, use_par_res = false, swin_norm = false, n_ctx_train = 8192, n_embd = 960, n_embd_features = 0, n_layer = 32, n_rot = 64, n_embd_head_k = 64, n_embd_head_v = 64, n_expert = 0, n_expert_used = 0, n_rel_attn_bkts = 0, n_embd_head_k_mla = 0, n_embd_head_v_mla = 0, posnet = {
            n_embd = 0, n_layer = 0}, convnext = {n_embd = 0, n_layer = 0}, n_head_arr = {_M_elems = {15 <repeats 32 times>, 0 <repeats 480 times>}}, n_head_kv_arr = {_M_elems = {5 <repeats 32 times>, 0 <repeats 480 times>}}, n_ff_arr = {_M_elems = {2560 <repeats 32 times>, 0 <repeats 480 times>}}, n_layer_dense_lead = 0, n_lora_q = 0, n_lora_kv = 0, 
          n_ff_exp = 0, n_ff_shexp = 0, n_expert_shared = 0, n_norm_groups = 0, expert_weights_scale = 0, expert_weights_norm = false, expert_gating_func = 0, moe_every_n_layers = 0, f_norm_eps = 0, f_norm_rms_eps = 9.99999975e-06, f_norm_group_eps = 0, f_attn_logit_softcapping = 50, f_final_logit_softcapping = 30, rescale_every_n_layers = 0, 
          time_mix_extra_dim = 0, time_decay_extra_dim = 0, wkv_head_size = 0, token_shift_count = 2, n_lora_decay = 0, n_lora_iclr = 0, n_lora_value_res_mix = 0, n_lora_gate = 0, rope_attn_factor = 1, rope_freq_base_train = 100000, rope_freq_base_train_swa = 100000, rope_freq_scale_train = 1, rope_freq_scale_train_swa = 1, n_ctx_orig_yarn = 8192, 
          rope_yarn_log_mul = 0, rope_sections = {_M_elems = {0, 0, 0, 0}}, swa_type = LLAMA_SWA_TYPE_NONE, n_swa = 0, swa_layers = {_M_elems = {false <repeats 512 times>}}, ssm_d_conv = 0, ssm_d_inner = 0, ssm_d_state = 0, ssm_dt_rank = 0, ssm_dt_b_c_rms = false, f_clamp_kqv = 0, f_max_alibi_bias = 0, f_logit_scale = 0, f_residual_scale = 0, 
          f_embedding_scale = 0, f_attention_scale = 0, causal_attn = true, use_alibi = false, attn_soft_cap = false, use_kq_norm = true, n_cls_out = 1, n_moe_layer_step = 0, n_no_rope_layer_step = 4, n_attn_temp_floor_scale = 8192, f_attn_temp_scale = 0.100000001, dec_start_token_id = -1, pooling_type = LLAMA_POOLING_TYPE_NONE, 
          rope_type = LLAMA_ROPE_TYPE_NORM, rope_scaling_type_train = LLAMA_ROPE_SCALING_TYPE_LINEAR}
        n_vocab = 49152
        n_tokens_all = 1
        n_embd = 960
        embd_pooled = false
        n_outputs_all = 1
        did_optimize = false
        mstate = std::unique_ptr<llama_memory_state_i> = {get() = 0x7fff8424d770}
        n_outputs_prev = 0
#16 0x00007ffff7b8fbd9 in llama_decode (ctx=0x7fff84000b70, batch=...) at /home/ubuntu/sources/llama.cpp/src/llama-context.cpp:2812
        ret = 32767
        __func__ = "llama_decode"
#17 0x00005555555d5b9f in operator() (__closure=0x555556c500f8) at /home/ubuntu/sources/llama.cpp/tests/test-thread-safety.cpp:120
        token = 198
        i = 27
        ctx = std::unique_ptr<llama_context> = {get() = 0x7fff84000b70}
        vocab = 0x555555fa2458
        sampler = std::unique_ptr<common_sampler> = {get() = 0x555556c52110}
        batch = {n_tokens = 1, token = 0x7fffa0fda6e0, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0, logits = 0x0}
        result = "Hello, my name is [Your Name]. I'm a [Job Title] and I'll be starting [Date] as a [Employer's Name].\n"
        model = 0x555555fa0900
        c = 3
        m = 0
        num_contexts = @0x7fffffffcb4c: 12
        num_models = @0x7fffffffcb48: 1
        cparams = @0x7fffffffcc60: {n_ctx = 4096, n_batch = 2048, n_ubatch = 512, n_seq_max = 12, n_threads = 16, n_threads_batch = 16, rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED, attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED, rope_freq_base = 0, rope_freq_scale = 0, yarn_ext_factor = -1, 
          yarn_attn_factor = 1, yarn_beta_fast = 32, yarn_beta_slow = 1, yarn_orig_ctx = 0, defrag_thold = 0.100000001, cb_eval = 0x0, cb_eval_user_data = 0x0, type_k = GGML_TYPE_F16, type_v = GGML_TYPE_F16, abort_callback = 0x0, abort_callback_data = 0x0, embeddings = false, offload_kqv = true, flash_attn = false, no_perf = false, op_offload = true, 
          swa_full = false}
        failed = std::atomic<bool> = { false }
        params = @0x7fffffffcd00: {n_predict = 100, n_ctx = 4096, n_batch = 2048, n_ubatch = 512, n_keep = 0, n_chunks = -1, n_parallel = 12, n_sequences = 1, grp_attn_n = 1, grp_attn_w = 512, n_print = -1, rope_freq_base = 0, rope_freq_scale = 0, yarn_ext_factor = -1, yarn_attn_factor = 1, yarn_beta_fast = 32, yarn_beta_slow = 1, yarn_orig_ctx = 0, 
          defrag_thold = 0.100000001, devices = std::vector of length 0, capacity 0, n_gpu_layers = 99, main_gpu = 0, tensor_split = {0 <repeats 128 times>}, split_mode = LLAMA_SPLIT_MODE_LAYER, cpuparams = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, 
          cpuparams_batch = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, cb_eval = 0x0, cb_eval_user_data = 0x0, numa = GGML_NUMA_STRATEGY_DISABLED, rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED, 
          attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED, sampling = {seed = 4294967295, n_prev = 64, n_probs = 0, min_keep = 0, top_k = 40, top_p = 0.949999988, min_p = 0.0500000007, xtc_probability = 0, xtc_threshold = 0.100000001, typ_p = 1, temp = 0.800000012, dynatemp_range = 0, dynatemp_exponent = 1, penalty_last_n = 64, penalty_repeat = 1, 
            penalty_freq = 0, penalty_present = 0, dry_multiplier = 0, dry_base = 1.75, dry_allowed_length = 2, dry_penalty_last_n = -1, mirostat = 0, top_n_sigma = -1, mirostat_tau = 5, mirostat_eta = 0.100000001, ignore_eos = false, no_perf = false, timing_per_token = false, dry_sequence_breakers = std::vector of length 4, capacity 4 = {"\n", ":", 
              "\"", "*"}, samplers = std::vector of length 9, capacity 9 = {COMMON_SAMPLER_TYPE_PENALTIES, COMMON_SAMPLER_TYPE_DRY, COMMON_SAMPLER_TYPE_TOP_N_SIGMA, COMMON_SAMPLER_TYPE_TOP_K, COMMON_SAMPLER_TYPE_TYPICAL_P, COMMON_SAMPLER_TYPE_TOP_P, COMMON_SAMPLER_TYPE_MIN_P, COMMON_SAMPLER_TYPE_XTC, COMMON_SAMPLER_TYPE_TEMPERATURE}, grammar = "", 
            grammar_lazy = false, grammar_triggers = std::vector of length 0, capacity 0, preserved_tokens = std::set with 0 elements, logit_bias = std::vector of length 0, capacity 0}, speculative = {devices = std::vector of length 0, capacity 0, n_ctx = 0, n_max = 16, n_min = 0, n_gpu_layers = -1, p_split = 0.100000001, p_min = 0.75, cpuparams = {
              n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, cpuparams_batch = {n_threads = 16, cpumask = {false <repeats 512 times>}, mask_valid = false, priority = GGML_SCHED_PRIO_NORMAL, strict_cpu = false, poll = 50}, model = {path = "", url = "", 
              hf_repo = "", hf_file = ""}}, vocoder = {model = {path = "", url = "", hf_repo = "", hf_file = ""}, speaker_file = "", use_guide_tokens = false}, model = {path = "/home/ubuntu/models/SmolLM2-360M-Instruct-BF16.gguf", url = "", hf_repo = "", hf_file = ""}, model_alias = "", hf_token = "", prompt = "Hello, my name is", system_prompt = "", 
          prompt_file = "", path_prompt_cache = "", input_prefix = "", input_suffix = "", lookup_cache_static = "", lookup_cache_dynamic = "", logits_file = "", in_files = std::vector of length 0, capacity 0, antiprompt = std::vector of length 0, capacity 0, kv_overrides = std::vector of length 0, capacity 0, 
          tensor_buft_overrides = std::vector of length 0, capacity 0, lora_init_without_apply = false, lora_adapters = std::vector of length 0, capacity 0, control_vectors = std::vector of length 0, capacity 0, verbosity = 0, control_vector_layer_start = -1, control_vector_layer_end = -1, offline = false, ppl_stride = 0, ppl_output_type = 0, 
          hellaswag = false, hellaswag_tasks = 400, winogrande = false, winogrande_tasks = 0, multiple_choice = false, multiple_choice_tasks = 0, kl_divergence = false, usage = false, completion = false, use_color = false, special = false, interactive = false, interactive_first = false, prompt_cache_all = false, prompt_cache_ro = false, escape = true, 
          multiline_input = false, simple_io = false, cont_batching = true, flash_attn = false, no_perf = false, ctx_shift = true, swa_full = false, input_prefix_bos = false, use_mmap = true, use_mlock = false, verbose_prompt = false, display_prompt = true, no_kv_offload = false, warmup = true, check_tensors = false, no_op_offload = false, 
          single_turn = false, cache_type_k = GGML_TYPE_F16, cache_type_v = GGML_TYPE_F16, conversation_mode = COMMON_CONVERSATION_MODE_AUTO, mmproj = {path = "", url = "", hf_repo = "", hf_file = ""}, mmproj_use_gpu = true, no_mmproj = false, image = std::vector of length 0, capacity 0, embedding = false, embd_normalize = 2, embd_out = "", 
          embd_sep = "\n", reranking = false, port = 8080, timeout_read = 600, timeout_write = 600, n_threads_http = -1, n_cache_reuse = 0, hostname = "127.0.0.1", public_path = "", chat_template = "", use_jinja = false, enable_chat_template = true, reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK, reasoning_budget = -1, prefill_assistant = true, 
          api_keys = std::vector of length 0, capacity 0, ssl_file_key = "", ssl_file_cert = "", webui = true, endpoint_slots = false, endpoint_props = false, endpoint_metrics = false, log_json = false, slot_save_path = "", slot_prompt_similarity = 0.5, is_pp_shared = false, n_pp = std::vector of length 0, capacity 0, 
          n_tg = std::vector of length 0, capacity 0, n_pl = std::vector of length 0, capacity 0, context_files = std::vector of length 0, capacity 0, chunk_size = 64, chunk_separator = "\n", n_junk = 250, i_pos = -1, n_out_freq = 10, n_save_freq = 0, i_chunk = 0, process_output = false, compute_ppl = true, parse_special = false, n_pca_batch = 100, 
          n_pca_iterations = 1000, cvector_dimre_method = DIMRE_METHOD_PCA, cvector_positive_file = "tools/cvector-generator/positive.txt", cvector_negative_file = "tools/cvector-generator/negative.txt", spm_infill = false, batched_bench_output_jsonl = false, out_file = "", load_progress_callback = 0x0, load_progress_callback_user_data = 0x0}
#18 0x00005555555d6c98 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
No locals.
#19 0x00005555555d6c5b in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
No locals.
#20 0x00005555555d6c08 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:292
No locals.
#21 0x00005555555d6bdc in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555556c500f8) at /usr/include/c++/13/bits/std_thread.h:299
No locals.
#22 0x00005555555d6bc0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555556c500f0) at /usr/include/c++/13/bits/std_thread.h:244
No locals.
#23 0x00007ffff5eecdb4 in std::execute_native_thread_routine (__p=0x555556c500f0) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
        __t = <optimized out>
#24 0x00007ffff5a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
        ret = <optimized out>
        pd = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735894507520, 7532769655212744952, 140735894507520, -160, 2, 140737488341088, 7532769655225327864, 7532671421702873336}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#25 0x00007ffff5b29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
No locals.
(gdb) info threads
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff4c84000 (LWP 45685) "test-thread-saf" 0x00007ffff5a98d71 in __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=45701, futex_word=0x7fffa1fff2d0) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffb785b000 (LWP 45686) "cuda00001400006" 0x00007ffff5b1b4cd in __GI___poll (fds=0x555555f25fb0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fffb5bdf000 (LWP 45698) "test-thread-saf" 0x00007ffff5a98d71 in __futex_abstimed_wait_common64 (private=32767, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5555558f6cbc <common_log_main()::log+92>) at ./nptl/futex-internal.c:57
  4    Thread 0x7fffb53de000 (LWP 45699) "cuda-EvtHandlr"  0x00007ffff5b1b4cd in __GI___poll (fds=0x7fff9c000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  6    Thread 0x7fffa1fff000 (LWP 45701) "test-thread-saf" 0x00007fffea21a44a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
* 8    Thread 0x7fffa0ffd000 (LWP 45703) "test-thread-saf" __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  9    Thread 0x7fff99ff0000 (LWP 45704) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  10   Thread 0x7fff997ef000 (LWP 45705) "test-thread-saf" 0x00007fffea44f83e in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  11   Thread 0x7fff98fee000 (LWP 45706) "test-thread-saf" 0x00007fffea44f841 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  12   Thread 0x7fff79fff000 (LWP 45707) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  13   Thread 0x7fff797fe000 (LWP 45708) "test-thread-saf" 0x00007ffff7fc3e36 in ?? ()
  14   Thread 0x7fff78ffd000 (LWP 45709) "test-thread-saf" 0x00007fffea44f80d in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  15   Thread 0x7fff75fff000 (LWP 45710) "test-thread-saf" 0x00007fffea44f841 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  16   Thread 0x7fff757fe000 (LWP 45711) "test-thread-saf" 0x00007fffea44f8f0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1

@jeffbolznv
Copy link
Collaborator

not all relevant resources of the backend are stored in relation to the backend context yet, so multiple contexts can use the same descriptors, for example.

OK, I'll take a look at this, as a start, and see how far it gets us.

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 16, 2025

It seems to be working now with Vulkan, in my tests.

@slaren slaren merged commit 6adc3c3 into master Jun 16, 2025
54 of 55 checks passed
@slaren slaren deleted the sl/thread-safety-test branch June 16, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants