Skip to content

[Bug]: prefix-caching: inconsistent completions #5543

@hibukipanim

Description

@hibukipanim

Your current environment

vLLM version 0.5.0.post1

🐛 Describe the bug

Hi,

Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-prefix-caching and here I'll show how to reproduce it with a short snippet.

Running 2 vLLM servers with:

without prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8001

and another with prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8002 --enable-prefix-caching

Then running this snippet:

import string 
import random

import openai

vllms = {
    "no-prefix-caching": "http://localhost:8001/v1",
    "with-prefix-caching": "http://localhost:8002/v1",
}

random.seed(0)
prompts = []
for i in range(16):
    prompts.append(''.join(random.choices(string.ascii_lowercase + string.digits, k=512)))

runs = []
for run in range(2):
    print(f"\n🏃 run #{run+1}")

    completions = {k: [] for k in vllms.keys()}
    runs.append(completions)
    for name, endpoint in vllms.items():
        print(f"vLLM {name=}, {endpoint=}")
        client = openai.OpenAI(
            base_url=endpoint,
            api_key="foo"
        )

        for prompt in prompts:
            response = client.completions.create(
                    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    prompt=prompt,
                    temperature=0,
                    max_tokens=4,
            )
            completion = response.choices[0].text
            completions[name].append(completion)

        print(f"completions: {completions[name]}")

        if run > 0 and runs[run][name] != runs[run-1][name]:
            print(f"❌ completions for vLLM {name=} differs from previous run!")
    
    if completions["with-prefix-caching"] != completions["no-prefix-caching"]:
        print("🛑 completions differ between with & without prefix")
        

prints:

🏃 run #1
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']

🏃 run #2
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwma71', '37wk', 'hu5qw', 'jg0m', '1tzkb', '4h7a', '5zq7', 'zxqj', '7k4n', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
❌ completions for vLLM name='with-prefix-caching' differs from previous run!
🛑 completions differ between with & without prefix

This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.

Hopefully one of these PR resolves the issue 🤞 :

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkeep-openPrevents stale label being applied

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions