Skip to content

[Bug]: EXAONE 4.0 with VSWA trtllm-serve failure #7741

@lkm2835

Description

@lkm2835

System Info

NVIDIA A100-40G

[2025-09-15 15:20:00] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:00] INFO config.py:66: Polars version 1.25.2 available.
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5

TensorRT-LLM version : https://github.com/NVIDIA/TensorRT-LLM/tree/89fc1369727f338aa653f089308e214ee1721655

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trtllm-serve command

CUDA_VISIBLE_DEVICES=0,1,2,3 trtllm-serve EXAONE-4.0-32B --backend pytorch --tp_size 4 --extra_llm_api_options config.yml

config.yml

kv_cache_config:
  enable_block_reuse: false
  max_attention_window: [4096,4096,4096,131072]
enable_chunked_prefill: true

Expected behavior

The model with VSWA should be served successfully through trtllm-serve.

actual behavior

[2025-09-15 15:20:00] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:00] INFO config.py:66: Polars version 1.25.2 available.
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-09-15 15:20:04,849 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[09/15/2025-15:20:07] [TRT-LLM] [I] Using LLM with PyTorch backend
[09/15/2025-15:20:07] [TRT-LLM] [I] Set nccl_plugin to None.
[09/15/2025-15:20:07] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[09/15/2025-15:20:07] [TRT-LLM] [I] start MpiSession with 4 workers
/app/disks/EXAONE-4.0-32B
rank 0 using MpiPoolSession to spawn MPI processes
[09/15/2025-15:20:07] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[09/15/2025-15:20:07] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[09/15/2025-15:20:07] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[09/15/2025-15:20:07] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[09/15/2025-15:20:07] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
[2025-09-15 15:20:21] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:21] INFO config.py:66: Polars version 1.25.2 available.
[2025-09-15 15:20:21] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:21] INFO config.py:66: Polars version 1.25.2 available.
[2025-09-15 15:20:21] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:21] INFO config.py:66: Polars version 1.25.2 available.
[2025-09-15 15:20:21] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 15:20:21] INFO config.py:66: Polars version 1.25.2 available.
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-09-15 15:20:28,533 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-09-15 15:20:28,642 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-09-15 15:20:28,816 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-09-15 15:20:28,816 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc5
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[09/15/2025-15:20:32] [TRT-LLM] [RANK 0] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, batch_wait_timeout_ms=0, attn_backend='TRTLLM', moe_backend='CUTLASS', moe_disable_finalize_fusion=False, enable_mixed_sampler=False, sampler_type=<SamplerType.auto: 'auto'>, kv_cache_dtype='auto', mamba_ssm_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_piecewise_cuda_graph_num_tokens=None, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, mm_encoder_only=False, _limit_torch_cuda_mem_fraction=True)
[09/15/2025-15:20:32] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES:  AttentionRuntimeFeatures(chunked_prefill=True, cache_reuse=False, has_speculative_draft_tokens=False, chunk_size=8192)
EXAONE-4.0-32B
EXAONE-4.0-32B
EXAONE-4.0-32B
EXAONE-4.0-32B
[09/15/2025-15:20:32] [TRT-LLM] [RANK 0] [I] Validating KV Cache config against kv_cache_dtype="auto"
[09/15/2025-15:20:32] [TRT-LLM] [RANK 0] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Use 14.90 GB for model weights.
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Prefetching 59.61GB checkpoint files.
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Prefetching EXAONE-4.0-32B/model-00001-of-00014.safetensors to memory...
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Prefetching EXAONE-4.0-32B/model-00013-of-00014.safetensors to memory...
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Prefetching EXAONE-4.0-32B/model-00011-of-00014.safetensors to memory...
[09/15/2025-15:20:33] [TRT-LLM] [RANK 0] [I] Prefetching EXAONE-4.0-32B/model-00003-of-00014.safetensors to memory...
[09/15/2025-15:20:38] [TRT-LLM] [RANK 0] [I] Finished prefetching EXAONE-4.0-32B/model-00013-of-00014.safetensors.
[09/15/2025-15:20:38] [TRT-LLM] [RANK 0] [I] Finished prefetching EXAONE-4.0-32B/model-00011-of-00014.safetensors.
[09/15/2025-15:20:38] [TRT-LLM] [RANK 0] [I] Finished prefetching EXAONE-4.0-32B/model-00001-of-00014.safetensors.
[09/15/2025-15:20:38] [TRT-LLM] [RANK 0] [I] Finished prefetching EXAONE-4.0-32B/model-00003-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 14/14 [00:00<00:00, 169.26it/s]
Loading safetensors weights in parallel: 100%|██████████| 14/14 [00:00<00:00, 163.02it/s]
Loading safetensors weights in parallel: 100%|██████████| 14/14 [00:00<00:00, 160.85it/s]
Loading safetensors weights in parallel: 100%|██████████| 14/14 [00:00<00:00, 158.65it/s]
Loading weights concurrently: 100%|██████████| 1353/1353 [00:06<00:00, 220.90it/s]
Model init total -- 12.51s
Loading weights concurrently: 100%|██████████| 1353/1353 [00:07<00:00, 192.98it/s]
Model init total -- 13.71s
Loading weights concurrently: 100%|██████████| 1353/1353 [00:07<00:00, 191.62it/s]
Model init total -- 13.72s
Loading weights concurrently: 100%|██████████| 1353/1353 [00:07<00:00, 190.14it/s]
Model init total -- 13.72s
[09/15/2025-15:20:46] [TRT-LLM] [RANK 0] [I] max_seq_len is not specified, using inferred value 131072
[09/15/2025-15:20:46] [TRT-LLM] [RANK 0] [I] ChunkUnitSize is set to 256 as sliding window attention is used.
[09/15/2025-15:20:46] [TRT-LLM] [RANK 0] [I] Using Sampler: TorchSampler
[TensorRT-LLM][WARNING] Setting maxTokens when using Variable Sliding Window Attention is a strange concept, as it limits the number of max tokens *per window size* [limiting the sum of all window sizes is even stranger]. Anticipating the effects of this requires quite a complex calculation, and it probably isn't the configuration you meant to use.
[TensorRT-LLM][WARNING] Setting maxTokens when using Variable Sliding Window Attention is a strange concept, as it limits the number of max tokens *per window size* [limiting the sum of all window sizes is even stranger]. Anticipating the effects of this requires quite a complex calculation, and it probably isn't the configuration you meant to use.
[TensorRT-LLM][WARNING] Setting maxTokens when using Variable Sliding Window Attention is a strange concept, as it limits the number of max tokens *per window size* [limiting the sum of all window sizes is even stranger]. Anticipating the effects of this requires quite a complex calculation, and it probably isn't the configuration you meant to use.
[TensorRT-LLM][WARNING] Setting maxTokens when using Variable Sliding Window Attention is a strange concept, as it limits the number of max tokens *per window size* [limiting the sum of all window sizes is even stranger]. Anticipating the effects of this requires quite a complex calculation, and it probably isn't the configuration you meant to use.
[TensorRT-LLM][INFO] Blocks per window size:
[TensorRT-LLM][INFO] Blocks per window size:
[TensorRT-LLM][INFO] [windowSize=4096] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] [windowSize=4096] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] [windowSize=131072] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] [windowSize=131072] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] Blocks per window size:
[TensorRT-LLM][INFO] [windowSize=4096] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] [windowSize=131072] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] Blocks per window size:
[TensorRT-LLM][INFO] [windowSize=4096] {.primaryBlocks=257, .secondayBlocks=0}
[TensorRT-LLM][INFO] [windowSize=131072] {.primaryBlocks=257, .secondayBlocks=0}
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [W] Attention window size 131072 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [W] Adjusted max_attention_window_vec to [4096, 4096, 4096, 8224]
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [W] Adjusted window size 131072 to 8224 in blocks_per_window
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.50 GiB for max tokens in paged KV cache (16448).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.50 GiB for max tokens in paged KV cache (16448).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.50 GiB for max tokens in paged KV cache (16448).
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.50 GiB for max tokens in paged KV cache (16448).
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:20:47] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
2025-09-15 15:20:50,443 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-09-15 15:20:50,443 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-09-15 15:20:50,447 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-09-15 15:20:50,455 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-09-15 15:21:34,467 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-09-15 15:21:34,499 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-09-15 15:21:34,533 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-09-15 15:21:34,571 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 0
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes.
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 130, 8224: 130}
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:40] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:41] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:41] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 194, 8224: 194}
[09/15/2025-15:21:41] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=64, draft_len=0
[09/15/2025-15:21:41] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:41] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:42] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:42] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 226, 8224: 226}
[09/15/2025-15:21:42] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=32, draft_len=0
[09/15/2025-15:21:42] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:42] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:43] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:43] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 227, 8224: 227}
[09/15/2025-15:21:43] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=31, draft_len=0
[09/15/2025-15:21:43] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:43] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:44] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:44] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 228, 8224: 228}
[09/15/2025-15:21:44] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=30, draft_len=0
[09/15/2025-15:21:44] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:44] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:45] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:45] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 229, 8224: 229}
[09/15/2025-15:21:45] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=29, draft_len=0
[09/15/2025-15:21:45] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:45] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:46] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:46] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 230, 8224: 230}
[09/15/2025-15:21:46] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=28, draft_len=0
[09/15/2025-15:21:46] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:46] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:47] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:47] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 231, 8224: 231}
[09/15/2025-15:21:47] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=27, draft_len=0
[09/15/2025-15:21:47] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:47] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:48] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:48] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 232, 8224: 232}
[09/15/2025-15:21:48] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=26, draft_len=0
[09/15/2025-15:21:48] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:48] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:49] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:49] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 233, 8224: 233}
[09/15/2025-15:21:49] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=25, draft_len=0
[09/15/2025-15:21:49] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:49] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:50] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:50] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 234, 8224: 234}
[09/15/2025-15:21:50] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=24, draft_len=0
[09/15/2025-15:21:50] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:50] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:51] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:51] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 235, 8224: 235}
[09/15/2025-15:21:51] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=23, draft_len=0
[09/15/2025-15:21:51] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:51] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:52] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:52] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 236, 8224: 236}
[09/15/2025-15:21:52] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=22, draft_len=0
[09/15/2025-15:21:52] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:52] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:53] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:53] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 237, 8224: 237}
[09/15/2025-15:21:53] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=21, draft_len=0
[09/15/2025-15:21:53] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:53] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:54] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:54] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 238, 8224: 238}
[09/15/2025-15:21:54] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=20, draft_len=0
[09/15/2025-15:21:54] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:54] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:55] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:55] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 239, 8224: 239}
[09/15/2025-15:21:55] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=19, draft_len=0
[09/15/2025-15:21:55] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:55] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:56] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:56] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 240, 8224: 240}
[09/15/2025-15:21:56] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=18, draft_len=0
[09/15/2025-15:21:56] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:56] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:57] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:57] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 241, 8224: 241}
[09/15/2025-15:21:57] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=17, draft_len=0
[09/15/2025-15:21:57] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:57] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:58] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:58] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 242, 8224: 242}
[09/15/2025-15:21:58] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=16, draft_len=0
[09/15/2025-15:21:58] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:58] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:21:59] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:21:59] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 243, 8224: 243}
[09/15/2025-15:21:59] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=15, draft_len=0
[09/15/2025-15:21:59] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:21:59] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:00] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:00] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 244, 8224: 244}
[09/15/2025-15:22:00] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=14, draft_len=0
[09/15/2025-15:22:00] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:00] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:01] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:01] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 245, 8224: 245}
[09/15/2025-15:22:01] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=13, draft_len=0
[09/15/2025-15:22:01] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:01] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:02] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:02] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 246, 8224: 246}
[09/15/2025-15:22:02] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=12, draft_len=0
[09/15/2025-15:22:02] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:02] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:03] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:03] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 247, 8224: 247}
[09/15/2025-15:22:03] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=11, draft_len=0
[09/15/2025-15:22:03] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:03] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:04] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:04] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 248, 8224: 248}
[09/15/2025-15:22:04] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=10, draft_len=0
[09/15/2025-15:22:04] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:04] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:05] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:05] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 249, 8224: 249}
[09/15/2025-15:22:05] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=9, draft_len=0
[09/15/2025-15:22:05] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:05] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:06] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:06] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 250, 8224: 250}
[09/15/2025-15:22:06] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=8, draft_len=0
[09/15/2025-15:22:06] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:06] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:07] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:07] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 251, 8224: 251}
[09/15/2025-15:22:07] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=7, draft_len=0
[09/15/2025-15:22:07] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:07] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:08] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:08] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 252, 8224: 252}
[09/15/2025-15:22:08] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=6, draft_len=0
[09/15/2025-15:22:08] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:08] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:09] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:09] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 253, 8224: 253}
[09/15/2025-15:22:09] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=5, draft_len=0
[09/15/2025-15:22:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:10] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:10] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 254, 8224: 254}
[09/15/2025-15:22:10] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=4, draft_len=0
[09/15/2025-15:22:10] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:10] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:11] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:11] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 255, 8224: 255}
[09/15/2025-15:22:11] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=3, draft_len=0
[09/15/2025-15:22:11] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:11] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:12] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:12] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 256, 8224: 256}
[09/15/2025-15:22:12] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=2, draft_len=0
[09/15/2025-15:22:12] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:12] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:13] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:13] [TRT-LLM] [RANK 0] [I] For VSWA case, we return the minimum of the number of free blocks for each window size: {4096: 257, 8224: 257}
[09/15/2025-15:22:13] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=1, draft_len=0
[09/15/2025-15:22:13] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[09/15/2025-15:22:13] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[09/15/2025-15:22:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (inside torch) in memory usage profiling: 15.30 GiB
[09/15/2025-15:22:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (outside torch) in memory usage profiling: 4.71 GiB
[09/15/2025-15:22:15] [TRT-LLM] [RANK 0] [I] Memory dynamically allocated during inference (inside torch) in memory usage profiling: 0.71 GiB
[09/15/2025-15:22:15] [TRT-LLM] [RANK 0] [I] Memory used outside torch (e.g., NCCL and CUDA graphs) in memory usage profiling: 4.71 GiB
[09/15/2025-15:22:15] [TRT-LLM] [RANK 0] [I] Peak memory during memory usage profiling (torch + non-torch): 20.73 GiB, available KV cache memory when calculating max tokens: 17.24 GiB, fraction is set 0.9, kv size is 65536. device total memory 39.38 GiB, , tmp kv_mem 0.50 GiB
[09/15/2025-15:22:15] [TRT-LLM] [RANK 0] [E] Failed to initialize executor on rank 0: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType
[09/15/2025-15:22:15] [TRT-LLM] [RANK 2] [E] Failed to initialize executor on rank 2: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType
[09/15/2025-15:22:15] [TRT-LLM] [RANK 1] [E] Failed to initialize executor on rank 1: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType
[09/15/2025-15:22:15] [TRT-LLM] [RANK 0] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 572, in create_py_executor
    kv_cache_creator.configure_kv_cache_capacity(py_executor)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 333, in configure_kv_cache_capacity
    self._kv_cache_config.max_tokens = None
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

[09/15/2025-15:22:15] [TRT-LLM] [RANK 2] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 572, in create_py_executor
    kv_cache_creator.configure_kv_cache_capacity(py_executor)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 333, in configure_kv_cache_capacity
    self._kv_cache_config.max_tokens = None
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

[09/15/2025-15:22:15] [TRT-LLM] [RANK 3] [E] Failed to initialize executor on rank 3: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType
[09/15/2025-15:22:15] [TRT-LLM] [RANK 1] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 572, in create_py_executor
    kv_cache_creator.configure_kv_cache_capacity(py_executor)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 333, in configure_kv_cache_capacity
    self._kv_cache_config.max_tokens = None
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

[09/15/2025-15:22:15] [TRT-LLM] [RANK 3] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 572, in create_py_executor
    kv_cache_creator.configure_kv_cache_capacity(py_executor)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 333, in configure_kv_cache_capacity
    self._kv_cache_config.max_tokens = None
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

[09/15/2025-15:22:15] [TRT-LLM] [E] Executor worker initialization error: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 572, in create_py_executor
    kv_cache_creator.configure_kv_cache_capacity(py_executor)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 333, in configure_kv_cache_capacity
    self._kv_cache_config.max_tokens = None
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self, arg: int, /) -> None

Invoked with types: tensorrt_llm.bindings.executor.KvCacheConfig, NoneType

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 358, in serve
    launch_server(host, port, llm_args, metadata_server_cfg, server_role)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
    llm = PyTorchLLM(**llm_args)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1031, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 946, in __init__
    super().__init__(model,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 216, in __init__
    self._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 975, in _build_model
    self._executor = self._executor_cls.create(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/executor.py", line 406, in create
    return GenerationExecutorProxy(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 107, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 332, in _start_executor_workers
    raise RuntimeError(
RuntimeError: Executor worker returned error

additional notes

I can run tests on any experiments to analyze and resolve this issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Pytorch<NV>Pytorch backend related issuesbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions