[Bug]: prefix caching doesn't work on CPU vLLM

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
(vllm-cpu) root@demo:~/vllm_source# python collect_env.py
[W511 11:57:36.699690500 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 05-11 11:57:37 [__init__.py:248] Automatically detected platform cpu.
WARNING 05-11 11:57:38 [_logger.py:72] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Collecting environment information...
PyTorch version: 2.7.0+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-17ubuntu1) 12.3.0
Clang version: Could not collect
CMake version: version 4.0.2
Libc version: glibc-2.39

Python version: 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA A10
Nvidia driver version: 565.57.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Alibaba Cloud
Model name:                           Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
BIOS Model name:                      pc-i440fx-2.1  CPU @ 0.0GHz
BIOS CPU family:                      1
CPU family:                           6
Model:                                106
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             6
BogoMIPS:                             5799.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             20 MiB (16 instances)
L3 cache:                             48 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.7.0
[pip3] numpy==2.2.5
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+cpu
[pip3] torchaudio==2.7.0+cpu
[pip3] torchvision==0.22.0+cpu
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev577+gca66a1674 (git sha: ca66a1674)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
```

</details>


### 🐛 Describe the bug

I install CPU vLLM following this [guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html?device=x86#cpu) and run vllm serve with the `--enable-prefix-caching` parameter.
```bash
vllm serve Qwen/Qwen2.5-1.5B-Instruct --enable-prefix-caching
```
Send inference requests like this:

```bash
for i in {1..100}; do
  echo "Request: $i"
  curl -sS http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen2.5-1.5B-Instruct",
      "messages": [
        {"role": "user", "content": "Tell me a joke"}
      ]
    }'
  echo ""
done
```

According to the logs, it appears that the prefix caching does not work. `Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%`.

```bash
(vllm-cpu) root@demo:~/vllm_source# vllm serve Qwen/Qwen2.5-1.5B-Instruct --enable-prefix-caching
[W511 12:03:43.958623328 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 05-11 12:03:44 [__init__.py:248] Automatically detected platform cpu.
INFO 05-11 12:03:46 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-11 12:03:46 [logger.py:64] max_model_len was is not set. Defaulting to arbitrary value of 8192.
WARNING 05-11 12:03:46 [logger.py:64] max_num_seqs was is not set. Defaulting to arbitrary value of 128.
INFO 05-11 12:03:47 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-11 12:03:47 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-11 12:03:47 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-11 12:03:47 [api_server.py:1044] vLLM API server version 0.8.5.dev577+gca66a1674
INFO 05-11 12:03:48 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-11 12:03:48 [cli_args.py:297] non-default args: {'model': 'Qwen/Qwen2.5-1.5B-Instruct', 'enable_prefix_caching': True}
INFO 05-11 12:03:57 [config.py:760] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 05-11 12:03:57 [_logger.py:72] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 05-11 12:03:57 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-11 12:03:57 [_logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 05-11 12:03:57 [_logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 05-11 12:03:57 [api_server.py:247] Started engine process with PID 58947
[W511 12:04:02.667756203 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 05-11 12:04:03 [__init__.py:248] Automatically detected platform cpu.
INFO 05-11 12:04:04 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev577+gca66a1674) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True, 
INFO 05-11 12:04:06 [cpu.py:57] Using Torch SDPA backend.
INFO 05-11 12:04:06 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-11 12:04:06 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 05-11 12:04:07 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.53it/s]

INFO 05-11 12:04:07 [default_loader.py:278] Loading weights took 0.21 seconds
INFO 05-11 12:04:07 [executor_base.py:112] # cpu blocks: 1170, # CPU blocks: 0
INFO 05-11 12:04:07 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 05-11 12:04:07 [llm_engine.py:435] init engine (profile, create kv cache, warmup model) took 0.15 seconds
WARNING 05-11 12:04:08 [logger.py:64] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-11 12:04:08 [serving_chat.py:116] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-11 12:04:08 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-11 12:04:08 [api_server.py:1091] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-11 12:04:08 [launcher.py:28] Available routes are:
INFO 05-11 12:04:08 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /health, Methods: GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /load, Methods: GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /version, Methods: GET
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /score, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-11 12:04:08 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [58701]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 05-11 12:04:13 [chat_utils.py:412] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 05-11 12:04:13 [logger.py:39] Received request chatcmpl-6d16ecc37f0a4732afefeebf5ef6bd8c: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:13 [engine.py:310] Added request chatcmpl-6d16ecc37f0a4732afefeebf5ef6bd8c.
WARNING 05-11 12:04:14 [_logger.py:72] Pin memory is not supported on CPU.
INFO 05-11 12:04:14 [metrics.py:486] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 05-11 12:04:14 [metrics.py:502] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
INFO:     127.0.0.1:43826 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-11 12:04:15 [logger.py:39] Received request chatcmpl-1b54257b6d9541feb29a1edb6fe538d3: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:15 [engine.py:310] Added request chatcmpl-1b54257b6d9541feb29a1edb6fe538d3.
INFO:     127.0.0.1:49282 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-11 12:04:17 [logger.py:39] Received request chatcmpl-b73b4b2df7fd461ca116a2d6cc7a1316: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:17 [engine.py:310] Added request chatcmpl-b73b4b2df7fd461ca116a2d6cc7a1316.
INFO 05-11 12:04:19 [metrics.py:486] Avg prompt throughput: 13.2 tokens/s, Avg generation throughput: 13.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 05-11 12:04:19 [metrics.py:502] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
INFO:     127.0.0.1:49294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-11 12:04:19 [logger.py:39] Received request chatcmpl-c602e1aa3bb8461bbaf36502d964e2a7: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:19 [engine.py:310] Added request chatcmpl-c602e1aa3bb8461bbaf36502d964e2a7.
INFO:     127.0.0.1:49300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-11 12:04:21 [logger.py:39] Received request chatcmpl-1726737a1c6a40af8a59c225b980541b: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:21 [engine.py:310] Added request chatcmpl-1726737a1c6a40af8a59c225b980541b.
INFO:     127.0.0.1:49316 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-11 12:04:23 [logger.py:39] Received request chatcmpl-a6964a16e2eb447c9560210b45d15e2e: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTell me a joke<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32735, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-11 12:04:23 [engine.py:310] Added request chatcmpl-a6964a16e2eb447c9560210b45d15e2e.
INFO 05-11 12:04:24 [metrics.py:486] Avg prompt throughput: 19.6 tokens/s, Avg generation throughput: 11.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 05-11 12:04:24 [metrics.py:502] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: prefix caching doesn't work on CPU vLLM #17954

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: prefix caching doesn't work on CPU vLLM #17954

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions