-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Open
Labels
bugSomething isn't workingSomething isn't workingkeep-openPrevents stale label being appliedPrevents stale label being applied
Description
Your current environment
vLLM version 0.5.0.post1
🐛 Describe the bug
Hi,
Seems that there is a dirty cache issue with --enable-prefix-caching
. We noticed it as we saw internal eval scores significantly degrade when running with --enable-prefix-caching
and here I'll show how to reproduce it with a short snippet.
Running 2 vLLM servers with:
without prefix caching:
python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8001
and another with prefix caching:
python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8002 --enable-prefix-caching
Then running this snippet:
import string
import random
import openai
vllms = {
"no-prefix-caching": "http://localhost:8001/v1",
"with-prefix-caching": "http://localhost:8002/v1",
}
random.seed(0)
prompts = []
for i in range(16):
prompts.append(''.join(random.choices(string.ascii_lowercase + string.digits, k=512)))
runs = []
for run in range(2):
print(f"\n🏃 run #{run+1}")
completions = {k: [] for k in vllms.keys()}
runs.append(completions)
for name, endpoint in vllms.items():
print(f"vLLM {name=}, {endpoint=}")
client = openai.OpenAI(
base_url=endpoint,
api_key="foo"
)
for prompt in prompts:
response = client.completions.create(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
prompt=prompt,
temperature=0,
max_tokens=4,
)
completion = response.choices[0].text
completions[name].append(completion)
print(f"completions: {completions[name]}")
if run > 0 and runs[run][name] != runs[run-1][name]:
print(f"❌ completions for vLLM {name=} differs from previous run!")
if completions["with-prefix-caching"] != completions["no-prefix-caching"]:
print("🛑 completions differ between with & without prefix")
prints:
🏃 run #1
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
🏃 run #2
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwma71', '37wk', 'hu5qw', 'jg0m', '1tzkb', '4h7a', '5zq7', 'zxqj', '7k4n', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
❌ completions for vLLM name='with-prefix-caching' differs from previous run!
🛑 completions differ between with & without prefix
This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.
Hopefully one of these PR resolves the issue 🤞 :
- [Core][Prefix Caching] Fix hashing logic for non-full blocks #5188
- [Core][Bugfix]: fix prefix caching for blockv2 #5364
(will be able to try building these branches and reproducing only in few days, hope tagging the PRs can help till then)
Edit: built and tried both PRs and they don't resolve the issue
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingkeep-openPrevents stale label being appliedPrevents stale label being applied