-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
The portable nightly build of Llama.cpp fails to initialise when setting context size above 22528. This is using -sm layer across 3 Arc GPUs. There is more than enough VRAM available.
How to reproduce
Steps to reproduce the error:
- Download a copy of Qwen3-30B-A3B-Q4_K_L.gguf.
- Download the latest nightly build of Llama.cpp from the releases section.
- Start the server with:
ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2 ZES_ENABLE_SYSMAN=1 SYCL_CACHE_PERSISTENT=1 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ./llama-server -c 22528 -ngl 999 -m /home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf --host 0.0.0.0 --port 8001 -sm layer --jinja
- If the context is set above 22628, the engine crashes with the following error:
llama_kv_cache_init: kv_size = 23552, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 782.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 736.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 690.00 MiB llama_init_from_model: KV self size = 2208.00 MiB, K (f16): 1104.00 MiB, V (f16): 1104.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4334944256 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL2 buffer of size 4334944256 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 6065967104 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 6065967104 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5626382848 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL1 buffer of size 5626382848 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf' terminate called without an active exception ./llama-server: line 2: 142366 Aborted (core dumped) LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(cd "$(dirname "$0")";pwd) $(cd "$(dirname "$0")";pwd)/llama-server-bin "$@"
By comparison, setting the context to or below 22528, the following log for KV cache information is generated:
llama_kv_cache_init: kv_size = 22528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 748.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 704.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 660.00 MiB llama_init_from_model: KV self size = 2112.00 MiB, K (f16): 1056.00 MiB, V (f16): 1056.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) llama_init_from_model: SYCL0 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL1 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL2 compute buffer size = 4070.12 MiB llama_init_from_model: SYCL_Host compute buffer size = 1440.19 MiB llama_init_from_model: graph nodes = 3270 (with bs=4096), 2646 (with bs=1) llama_init_from_model: graph splits = 4
As you can see, the difference in compute buffer size is hardly an issue when comparing 22528 context to 24576, however it still fails to initialise.