Skip to content

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

@HumerousGorgon

Description

@HumerousGorgon

Describe the bug
The portable nightly build of Llama.cpp fails to initialise when setting context size above 22528. This is using -sm layer across 3 Arc GPUs. There is more than enough VRAM available.

How to reproduce
Steps to reproduce the error:

  1. Download a copy of Qwen3-30B-A3B-Q4_K_L.gguf.
  2. Download the latest nightly build of Llama.cpp from the releases section.
  3. Start the server with:
    ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2 ZES_ENABLE_SYSMAN=1 SYCL_CACHE_PERSISTENT=1 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ./llama-server -c 22528 -ngl 999 -m /home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf --host 0.0.0.0 --port 8001 -sm layer --jinja
  4. If the context is set above 22628, the engine crashes with the following error:
    llama_kv_cache_init: kv_size = 23552, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 782.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 736.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 690.00 MiB llama_init_from_model: KV self size = 2208.00 MiB, K (f16): 1104.00 MiB, V (f16): 1104.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4334944256 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL2 buffer of size 4334944256 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 6065967104 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 6065967104 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5626382848 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL1 buffer of size 5626382848 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf' terminate called without an active exception ./llama-server: line 2: 142366 Aborted (core dumped) LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(cd "$(dirname "$0")";pwd) $(cd "$(dirname "$0")";pwd)/llama-server-bin "$@"

By comparison, setting the context to or below 22528, the following log for KV cache information is generated:
llama_kv_cache_init: kv_size = 22528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 748.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 704.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 660.00 MiB llama_init_from_model: KV self size = 2112.00 MiB, K (f16): 1056.00 MiB, V (f16): 1056.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) llama_init_from_model: SYCL0 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL1 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL2 compute buffer size = 4070.12 MiB llama_init_from_model: SYCL_Host compute buffer size = 1440.19 MiB llama_init_from_model: graph nodes = 3270 (with bs=4096), 2646 (with bs=1) llama_init_from_model: graph splits = 4

As you can see, the difference in compute buffer size is hardly an issue when comparing 22528 context to 24576, however it still fails to initialise.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions