[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser #17917

wukaixingxp · 2025-05-09T18:34:30Z

Change the llama4 pythonic template and small fix on the edge case where llama4 model may output <|python_start|> unexpectedly.
BFCL test result:

Name	reported	Base_vllm	Pythonic_vllm
Model	Llama-4-Scout-17B-16E-Instruct (FC)	Llama-4-Scout-17B-16E-Instruct (FC)	Llama-4-Scout-17B-16E-Instruct (FC)
Overall Acc	45.41%	47.78%	55.97%
Non-Live AST Acc	83.48%	83.67%	80.04%
Non-Live Simple AST	79.42%	79.17%	78.67%
Non-Live Multiple AST	95%	94.00%	92.00%
Non-Live Parallel AST	81.5%	80.50%	77.00%
Non-Live Parallel Multiple AST	78%	81.00%	72.50%
Live Acc	57.97%	58.69%	74.06%
Live Simple AST	77.91%	82.17%	80.62%
Live Multiple AST	74.36%	74.07%	73.22%
Live Parallel AST	68.75%	75.00%	75.00%
Live Parallel Multiple AST	62.5%	70.83%	66.67%
Multi Turn Acc	1.88%	6.38%	13.00%
Multi Turn Base	2%	8.50%	15.50%
Multi Turn Miss Func	2.5%	5.00%	14.00%
Multi Turn Miss Param	1%	4.00%	11.00%
Multi Turn Long Context	2%	8.00%	11.50%
Relevance Detection	100%	94.44%	77.78%
Irrelevance Detection	39.66%	44.38%	78.70%

NOTE: Since BFCL has a default tool-call system prompt, we need to manually modified pythonic and json system prompt from here
For jinja template please see this example:
Given this test data:

{
    "bos_token": "<|begin_of_text|>",
    "add_generation_prompt": true,
 "custom_tools": [
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": ["city"],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
],
    "messages": [
#{"role": "system", "content": "you are helpful assistant"},

{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "I am a Llama 4 model",tool_calls:[]},
      {"role": "user", "content": "What is the weather in SF and Seattle?"},
{"role": "assistant", "content": "[get_weather(city=\"San Francisco\"), get_weather(city=\"Seattle\")]",tool_calls:[]},
{
    "role": "ipython",
    "content": "[        {            \"response\": \"Sunny 75\"        },        {            \"response\": \"Rainy 65\"        },    ]"
},{"role": "assistant", "content": "SF is Sunny and Seattle is Rainy"},
{"role": "user", "content": "What is the weather in NYC and SF"},
{"role": "assistant", "content": "",'tool_calls': [
            {
                'name': 'get_weather',
                'arguments': {
                    'city': 'NYC',
                   'metric': 'fahrenheit',
                }
            },
{
                'name': 'get_weather',
                'arguments': {
                    'city': 'SF',
                   'metric': 'fahrenheit',
                }
            },
        ],},
    ]
  }

The jinja template will render the output like this:

<|begin_of_text|><|header_start|>system<|header_end|>

You are a helpful assistant and an expert in function composition. You can answer general questions using your internal knowledge OR invoke functions when necessary. Follow these strict guidelines:

1. FUNCTION CALLS:
- ONLY use functions that are EXPLICITLY listed in the function list below
- If NO functions are listed (empty function list []), respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If a function is not in the list, respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If ALL required parameters are present AND the query EXACTLY matches a listed function's purpose: output ONLY the function call(s)
- Use exact format: [func_name1(param1=value1, param2=value2), func_name2(...)]
Examples:
CORRECT: [get_weather(location="Vancouver"), calculate_route(start="Boston", end="New York")] <- Only if get_weather and calculate_route are in function list
INCORRECT: get_weather(location="New York")
INCORRECT: Let me check the weather: [get_weather(location="New York")]
INCORRECT: [get_events(location="Singapore")] <- If function not in list

2. RESPONSE RULES:
- For pure function requests matching a listed function: ONLY output the function call(s)
- For knowledge questions: ONLY output text
- For missing parameters: ONLY request the specific missing parameters
- For unavailable services (not in function list): output ONLY with internal knowledge or "I don't have access to [Unavailable service] information". Do NOT execute a function call.
- If the query asks for information beyond what a listed function provides: output ONLY with internal knowledge about your limitations
- NEVER combine text and function calls in the same response
- NEVER suggest alternative functions when the requested service is unavailable
- NEVER create or invent new functions not listed below

3. STRICT BOUNDARIES:
- ONLY use functions from the list below - no exceptions
- NEVER use a function as an alternative to unavailable information
- NEVER call functions not present in the function list
- NEVER add explanatory text to function calls
- NEVER respond with empty brackets
- Use proper Python/JSON syntax for function calls
- Check the function list carefully before responding

4. TOOL RESPONSE HANDLING:
- When receiving tool responses: provide concise, natural language responses
- Don't repeat tool response verbatim
- Don't add supplementary information

Here is a list of functions in JSON format that you can invoke:
[
    {
        "description": "Get weather info for places",
        "name": "get_weather",
        "parameters": {
            "properties": {
                "city": {
                    "description": "The name of the city to get the weather for",
                    "type": "string"
                },
                "metric": {
                    "default": "celsius",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "type": "string"
                }
            },
            "required": [
                "city"
            ],
            "type": "dict"
        }
    }
]
<|eot|><|header_start|>user<|header_end|>

Who are you?<|eot|><|header_start|>assistant<|header_end|>

I am a Llama 4 model<|eot|><|header_start|>user<|header_end|>

What is the weather in SF and Seattle?<|eot|><|header_start|>assistant<|header_end|>

[get_weather(city="San Francisco"), get_weather(city="Seattle")]<|eot|><|header_start|>ipython<|header_end|>

"[        {            \"response\": \"Sunny 75\"        },        {            \"response\": \"Rainy 65\"        },    ]"<|eot|><|header_start|>assistant<|header_end|>

SF is Sunny and Seattle is Rainy<|eot|><|header_start|>user<|header_end|>

What is the weather in NYC and SF<|eot|><|header_start|>assistant<|header_end|>

[get_weather(city="NYC", metric="fahrenheit"), get_weather(city="SF", metric="fahrenheit")]<|eot|><|header_start|>assistant<|header_end|>

github-actions · 2025-05-09T18:34:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yeqcharlotte

great thank you! please fix the long lines.

and do could you get the results from llama-stack eval?
please also double check unit tests on vllm side pytest -s -vv tests/tool_use --models llama4 --extended

examples/tool_chat_template_llama4_pythonic.jinja

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

This fixes an issue in how we used the tool_call_buf from streaming tool calls in the remote-vllm provider where it would end up concatenating parameters from multiple different tool call results instead of aggregating the results from each tool call separately. It also fixes an issue found while digging into that where we were accidentally mixing the json string form of tool call parameters with the string representation of the python form, which mean we'd end up with single quotes in what should be double-quoted json strings. The following tests are now passing 100% for the remote-vllm provider, where some of the test_text_inference were failing before this change: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ``` Many of the agent tests are passing, although some are failing due to bugs in vLLM's pythonic tool parser for Llama models. See the PR at vllm-project/vllm#17917 and a gist at https://gist.github.com/bbrowning/b5007709015cb2aabd85e0bd08e6d60f for changes needed there, which will have to get made upstream in vLLM. Agent tests: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ```` Signed-off-by: Ben Browning <[email protected]>

yeqcharlotte · 2025-05-14T22:56:00Z

thanks for sharing the new template that reaches BFCL parity! mind also update the test summary to make it readable?

This fixes an issue in how we used the tool_call_buf from streaming tool calls in the remote-vllm provider where it would end up concatenating parameters from multiple different tool call results instead of aggregating the results from each tool call separately. It also fixes an issue found while digging into that where we were accidentally mixing the json string form of tool call parameters with the string representation of the python form, which mean we'd end up with single quotes in what should be double-quoted json strings. The following tests are now passing 100% for the remote-vllm provider, where some of the test_text_inference were failing before this change: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ``` Many of the agent tests are passing, although some are failing due to bugs in vLLM's pythonic tool parser for Llama models. See the PR at vllm-project/vllm#17917 and a gist at https://gist.github.com/bbrowning/b5007709015cb2aabd85e0bd08e6d60f for changes needed there, which will have to get made upstream in vLLM. Agent tests: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ```` Signed-off-by: Ben Browning <[email protected]>

FWao · 2025-05-15T15:58:20Z

The new Jinja template behaves differently from the old one when a user provides a system prompt.

Problem:
In the new template, if the user sets a system message, the list of tools is added—but the instructions for using Python-style tool calls are missing. Because of this, the model makes JSON-style tool calls instead.

Expected (Old Template):
In the old template, tool instructions were always included, even if the user provided a custom system prompt. The model worked as expected.

Current (New Template):
Now, the user has to manually add tool call instructions to the system prompt. This was not needed before.

wukaixingxp · 2025-05-15T18:13:03Z

The new Jinja template behaves differently from the old one when a user provides a system prompt.

Problem: In the new template, if the user sets a system message, the list of tools is added—but the instructions for using Python-style tool calls are missing. Because of this, the model makes JSON-style tool calls instead.

Expected (Old Template): In the old template, tool instructions were always included, even if the user provided a custom system prompt. The model worked as expected.

Current (New Template): Now, the user has to manually add tool call instructions to the system prompt. This was not needed before.

Thank you so much for this feedback.. I will write more test to make sure it works for all other cases..

# What does this PR do? This fixes an issue in how we used the tool_call_buf from streaming tool calls in the remote-vllm provider where it would end up concatenating parameters from multiple different tool call results instead of aggregating the results from each tool call separately. It also fixes an issue found while digging into that where we were accidentally mixing the json string form of tool call parameters with the string representation of the python form, which mean we'd end up with single quotes in what should be double-quoted json strings. Closes #1120 ## Test Plan The following tests are now passing 100% for the remote-vllm provider, where some of the test_text_inference were failing before this change: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ``` All but one of the agent tests are passing (including the multi-tool one). See the PR at vllm-project/vllm#17917 and a gist at https://gist.github.com/bbrowning/4734240ce96b4264340caa9584e47c9e for changes needed there, which will have to get made upstream in vLLM. Agent tests: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" ```` --------- Signed-off-by: Ben Browning <[email protected]>

wukaixingxp · 2025-05-15T22:12:32Z

Hi! Thanks for your feedback. The minimum tool-definition string and tool-output expectation will be appended to the user-provided system prompt now. For tool-call, the idea is that for basic users, we recommend not to set any system prompt so that our default comprehensive system prompt will be used, whereas for advanced users, who want to use their own customized system prompt, we will provide minimum tool-definition string and tool-output expectation at the end. Please check the following test examples:
No user-provided system prompt, use default
User-provided system prompt, append tool definition at the end
User-provided system prompt, but no-tool provided
No system prompt and no tool

The new Jinja template behaves differently from the old one when a user provides a system prompt.

Problem: In the new template, if the user sets a system message, the list of tools is added—but the instructions for using Python-style tool calls are missing. Because of this, the model makes JSON-style tool calls instead.

Expected (Old Template): In the old template, tool instructions were always included, even if the user provided a custom system prompt. The model worked as expected.

Current (New Template): Now, the user has to manually add tool call instructions to the system prompt. This was not needed before.

bbrowning · 2025-05-16T15:58:32Z

Attemping to run these changes locally with the Berkeley function calling leaderboard and my own vLLM, it appears to only be using the completions endpoint (instead of chat completions) when testing the Llama 4 Scout model. For completeness, here's how I'm running bfcl against my local vLLM serving Llama 4 Scout:

bfcl generate --model meta-llama/Llama-4-Scout-17B-16E-Instruct-FC --skip-server-setup

The configuration for meta-llama/Llama-4-Scout-17B-16E-Instruct-FC in bfcl sends all requests to the completions endpoint (instead of chat/completions), so the jinja template and tool call parser are not used at all in that flow. Are there changes used to update bfcl to use chat/completions when testing Llama 4 Scout?

wukaixingxp · 2025-05-16T16:11:09Z

Attemping to run these changes locally with the Berkeley function calling leaderboard and my own vLLM, it appears to only be using the completions endpoint (instead of chat completions) when testing the Llama 4 Scout model. For completeness, here's how I'm running bfcl against my local vLLM serving Llama 4 Scout:
bfcl generate --model meta-llama/Llama-4-Scout-17B-16E-Instruct-FC --skip-server-setup
The configuration for meta-llama/Llama-4-Scout-17B-16E-Instruct-FC in bfcl sends all requests to the completions endpoint (instead of chat/completions), so the jinja template and tool call parser are not used at all in that flow. Are there changes used to update bfcl to use chat/completions when testing Llama 4 Scout?

Change this default system prompt to

You are a helpful assistant and an expert in function composition. You can answer general questions using your internal knowledge OR invoke functions when necessary. Follow these strict guidelines:

1. FUNCTION CALLS:
- ONLY use functions that are EXPLICITLY listed in the function list below
- If NO functions are listed (empty function list []), respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If a function is not in the list, respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If ALL required parameters are present AND the query EXACTLY matches a listed function's purpose: output ONLY the function call(s)
- Use exact format: [func_name1(param1=value1, param2=value2), func_name2(...)]
Examples:
CORRECT: [get_weather(location="Vancouver"), calculate_route(start="Boston", end="New York")] <- Only if get_weather and calculate_route are in function list
INCORRECT: get_weather(location="New York")
INCORRECT: Let me check the weather: [get_weather(location="New York")]
INCORRECT: [get_events(location="Singapore")] <- If function not in list

2. RESPONSE RULES:
- For pure function requests matching a listed function: ONLY output the function call(s)
- For knowledge questions: ONLY output text
- For missing parameters: ONLY request the specific missing parameters
- For unavailable services (not in function list): output ONLY with internal knowledge or "I don't have access to [Unavailable service] information". Do NOT execute a function call.
- If the query asks for information beyond what a listed function provides: output ONLY with internal knowledge about your limitations
- NEVER combine text and function calls in the same response
- NEVER suggest alternative functions when the requested service is unavailable
- NEVER create or invent new functions not listed below

3. STRICT BOUNDARIES:
- ONLY use functions from the list below - no exceptions
- NEVER use a function as an alternative to unavailable information
- NEVER call functions not present in the function list
- NEVER add explanatory text to function calls
- NEVER respond with empty brackets
- Use proper Python/JSON syntax for function calls
- Check the function list carefully before responding

4. TOOL RESPONSE HANDLING:
- When receiving tool responses: provide concise, natural language responses
- Don't repeat tool response verbatim
- Don't add supplementary information

bbrowning · 2025-05-16T16:16:32Z

Ahh, I see. I was trying to run the function calling leaderboard via vLLM in a way that exercised the actual vLLM jinja template and tool parser. But, I see what you're doing is not doing that and instead just copying the same prompt into the bfcl code but still using the completions endpoint for testing. Your way is a reasonable way to test the prompt, although that won't end up actually exercising this tool call parser change at all I don't think?

wukaixingxp · 2025-05-16T16:22:59Z

Yeah.. I think the function definition are prepared from BFCL to the system prompt to bypass our jinja template/ parser. I mentioned BFCL just to show this jinja template can give a better result. I will run pytest -s -vv tests/tool_use --models llama4 --extended and llama-stack-evals to test the jinja_template + parser in my PRs.

wukaixingxp · 2025-05-16T17:02:27Z

Ahh, I see. I was trying to run the function calling leaderboard via vLLM in a way that exercised the actual vLLM jinja template and tool parser. But, I see what you're doing is not doing that and instead just copying the same prompt into the bfcl code but still using the completions endpoint for testing. Your way is a reasonable way to test the prompt, although that won't end up actually exercising this tool call parser change at all I don't think?

BTW, llama-stack-evals also support BFCL now, I think it will just take a openai compatible server and rely on the jinja template/parser from vllm.

wukaixingxp · 2025-05-16T19:56:44Z

Test with llama-stack-eval

create ./llama4_pythonic.jinja based on the jinja template
install current commit vllm uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly and modified the site_package pythonic_tool_parser.py in /home/.conda/envs/vllm/lib/python3.10/sitepackages/vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py
Started the vllm server VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp4 --seed 0 --max-model-len=100000 --host 0.0.0.0 --port 8001 --enable-auto-tool-choice --tool-call-parser llama4_pythonic --chat-template ./llama4_pythonic.jinja --limit-mm-per-prompt image=5 --max-num-seqs 20
Install llama-stack evals: git clone https://github.com/fairinternal/llama-stack-evals.git; cd llama-stack-evals;pip install -e .
run llama-stack-evals: llama-stack-evals run-tests --model meta-llama/Llama-4-Scout-17B-16E-Instruct --provider vllm
results, only two error code test failed, all other test passed:

llama-stack-evals run-tests --model meta-llama/Llama-4-Scout-17B-16E-Instruct --provider vllm

=== Running tests for provider: vllm, model: meta-llama/Llama-4-Scout-17B-16E-Instruct ===
Running command: /home/kaiwu/.conda/envs/evals/bin/python3.12 -m pytest llama_stack_evals/functional_tests/openai_api/test_chat_completion.py --model=meta-llama/Llama-4-Scout-17B-16E-Instruct -v --provider=vllm
====================================== test session starts ======================================
platform linux -- Python 3.12.9, pytest-8.3.5, pluggy-1.6.0 -- /home/kaiwu/.conda/envs/evals/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.4.3-0_fbk20_zion_2830_g3e5ab162667d-x86_64-with-glibc2.34', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'metadata': '3.1.1', 'json-report': '1.5.0'}}
rootdir: /home/kaiwu/work/llama-stack-evals
configfile: pyproject.toml
plugins: anyio-4.9.0, metadata-3.1.1, json-report-1.5.0
collected 34 items                                                                              

llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_basic[meta-llama/Llama-4-Scout-17B-16E-Instruct-earth] PASSED [  2%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_basic[meta-llama/Llama-4-Scout-17B-16E-Instruct-saturn] PASSED [  5%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_basic[meta-llama/Llama-4-Scout-17B-16E-Instruct-earth] PASSED [  8%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_basic[meta-llama/Llama-4-Scout-17B-16E-Instruct-saturn] PASSED [ 11%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_image[meta-llama/Llama-4-Scout-17B-16E-Instruct-case0] PASSED [ 14%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_image[meta-llama/Llama-4-Scout-17B-16E-Instruct-case0] PASSED [ 17%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_structured_output[meta-llama/Llama-4-Scout-17B-16E-Instruct-extract] PASSED [ 20%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_structured_output[meta-llama/Llama-4-Scout-17B-16E-Instruct-extract] PASSED [ 23%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] PASSED [ 26%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-user_provided_system_prompt] XFAIL [ 29%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-array_param] XFAIL [ 32%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] PASSED [ 35%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-user_provided_system_prompt] XFAIL [ 38%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-array_param] XFAIL [ 41%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_tool_choice_required[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] XPASS [ 44%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_tool_choice_required[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] XPASS [ 47%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_tool_choice_none[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] XFAIL [ 50%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_tool_choice_none[meta-llama/Llama-4-Scout-17B-16E-Instruct-basic] XPASS [ 52%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_multi_turn_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-text_then_tool] PASSED [ 55%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_multi_turn_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_then_text] PASSED [ 58%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_multi_turn_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-text_then_tool] XPASS [ 61%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_multi_turn_tool_calling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_then_text] XPASS [ 64%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_multi_turn_multiple_images[meta-llama/Llama-4-Scout-17B-16E-Instruct-stream=False] PASSED [ 67%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_multi_turn_multiple_images[meta-llama/Llama-4-Scout-17B-16E-Instruct-stream=True] PASSED [ 70%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-messages_missing] PASSED [ 73%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-messages_role_invalid] PASSED [ 76%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_choice_invalid] PASSED [ 79%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_choice_no_tools] PASSED [ 82%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] FAILED [ 85%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-messages_missing] PASSED [ 88%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-messages_role_invalid] PASSED [ 91%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_choice_invalid] PASSED [ 94%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tool_choice_no_tools] PASSED [ 97%]
llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] FAILED [100%]

=========================================== FAILURES ============================================
_ test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] _

request = <FixtureRequest for <Function test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid]>>
openai_client = <openai.OpenAI object at 0x7fef7f10daf0>
model = 'meta-llama/Llama-4-Scout-17B-16E-Instruct', provider = 'vllm'
verification_config = {'cerebras': ProviderConfig(provider='cerebras', base_url='https://api.cerebras.ai/v1', api_key_var='CEREBRAS_API_KEY'...', 'Llama-4-Maverick-17B-128E-Instruct-FP8': 'Llama-4-Maverick-Instruct'}, test_exclusions={}, self_hosted=False), ...}
case = {'case_id': 'tools_type_invalid', 'input': {'messages': [{'content': 'Which planet do humans live on?', 'role': 'user'}], 'tools': [{'type': 'invalid'}]}, 'output': {'error': {'status_code': 400}}}

    @pytest.mark.parametrize(
        "case",
        chat_completion_test_cases["test_chat_input_validation"]["test_params"]["case"],
        ids=case_id_generator,
    )
    def test_chat_non_streaming_error_handling(request, openai_client, model, provider, verification_config, case):
        test_name_base = get_base_test_name(request)
        if should_skip_test(verification_config, provider, model, test_name_base):
            pytest.skip(f"Skipping {test_name_base} for model {model} on provider {provider} based on config.")
    
        with pytest.raises(APIError) as e:
            openai_client.chat.completions.create(
                model=model,
                messages=case["input"]["messages"],
                stream=False,
                tool_choice=case["input"]["tool_choice"] if "tool_choice" in case["input"] else None,
                tools=case["input"]["tools"] if "tools" in case["input"] else None,
            )
>       assert case["output"]["error"]["status_code"] == e.value.status_code
E       AssertionError: assert 400 == 500
E        +  where 500 = InternalServerError('Error code: 500').status_code
E        +    where InternalServerError('Error code: 500') = <ExceptionInfo InternalServerError('Error code: 500') tblen=5>.value

llama_stack_evals/functional_tests/openai_api/test_chat_completion.py:686: AssertionError
_ test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] _

request = <FixtureRequest for <Function test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid]>>
openai_client = <openai.OpenAI object at 0x7fef7ee5eed0>
model = 'meta-llama/Llama-4-Scout-17B-16E-Instruct', provider = 'vllm'
verification_config = {'cerebras': ProviderConfig(provider='cerebras', base_url='https://api.cerebras.ai/v1', api_key_var='CEREBRAS_API_KEY'...', 'Llama-4-Maverick-17B-128E-Instruct-FP8': 'Llama-4-Maverick-Instruct'}, test_exclusions={}, self_hosted=False), ...}
case = {'case_id': 'tools_type_invalid', 'input': {'messages': [{'content': 'Which planet do humans live on?', 'role': 'user'}], 'tools': [{'type': 'invalid'}]}, 'output': {'error': {'status_code': 400}}}

    @pytest.mark.parametrize(
        "case",
        chat_completion_test_cases["test_chat_input_validation"]["test_params"]["case"],
        ids=case_id_generator,
    )
    def test_chat_streaming_error_handling(request, openai_client, model, provider, verification_config, case):
        test_name_base = get_base_test_name(request)
        if should_skip_test(verification_config, provider, model, test_name_base):
            pytest.skip(f"Skipping {test_name_base} for model {model} on provider {provider} based on config.")
    
        with pytest.raises(APIError) as e:
            response = openai_client.chat.completions.create(
                model=model,
                messages=case["input"]["messages"],
                stream=True,
                tool_choice=case["input"]["tool_choice"] if "tool_choice" in case["input"] else None,
                tools=case["input"]["tools"] if "tools" in case["input"] else None,
            )
            for _chunk in response:
                pass
>       assert str(case["output"]["error"]["status_code"]) in e.value.message
E       AssertionError: assert '400' in 'Error code: 500'
E        +  where '400' = str(400)
E        +  and   'Error code: 500' = InternalServerError('Error code: 500').message
E        +    where InternalServerError('Error code: 500') = <ExceptionInfo InternalServerError('Error code: 500') tblen=5>.value

llama_stack_evals/functional_tests/openai_api/test_chat_completion.py:709: AssertionError
==================================== short test summary info ====================================
FAILED llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] - AssertionError: assert 400 == 500
FAILED llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] - AssertionError: assert '400' in 'Error code: 500'
====================== 2 failed, 22 passed, 5 xfailed, 5 xpassed in 40.78s ======================

wukaixingxp · 2025-05-19T21:23:26Z

Tested using pytest -s -vv tests/tool_use --models llama4 --extended by @yeqcharlotte, only two errors but they are from model not from our code:

call get_weather instead of get_current_weather

>       assert tool_calls[0].function.name == WEATHER_TOOL["function"]["name"]
E       AssertionError: assert 'get_weather' == 'get_current_weather'
E         
E         - get_current_weather
E         + get_weather

tests/tool_use/test_tool_calls.py:41: AssertionError

Call get_weather when there is no tool:

>       assert choice.finish_reason != "tool_calls"  # "stop" or "length"
E       assert 'tool_calls' != 'tool_calls'
E        +  where 'tool_calls' = Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-09cfbb311c784573ab7cc83d14be7eb8', function=Function(arguments='{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}', name='get_current_weather'), type='function')], reasoning_content=None), stop_reason=None).finish_reason

tests/tool_use/test_tool_calls.py:154: AssertionError

Signed-off-by: Kai Wu <[email protected]>

DarkLight1337 · 2025-05-21T05:24:08Z

cc @aarnphm @houseroad

Signed-off-by: Kai Wu <[email protected]>

aarnphm

Can you also fix the pre-commit problem here?

few questions, but in general this looks good. Thanks for adding this.

examples/tool_chat_template_llama4_pythonic.jinja

Signed-off-by: Kai Wu <[email protected]>

wukaixingxp · 2025-05-21T17:17:12Z

Can you also fix the pre-commit problem here?

few questions, but in general this looks good. Thanks for adding this.

Yes.. I just fixed it. Now waiting for the final PR run to be completed

yeqcharlotte

thanks for adding the doc and unit tests!

DarkLight1337 · 2025-05-21T17:28:52Z

We have temporarily paused all non-essential PRs to fix the CI. Please merge from main after #18418 is resolved

wukaixingxp · 2025-05-22T18:00:15Z

We have temporarily paused all non-essential PRs to fix the CI. Please merge from main after #18418 is resolved

@DarkLight1337 Can we me this PR now given that the issue has been fixed? CC: @yeqcharlotte @houseroad

houseroad · 2025-05-22T18:10:27Z

Can we rebase to main?

bbrowning · 2025-05-22T18:42:13Z

examples/tool_chat_template_llama4_pythonic.jinja

-{%- endif %}
-{%- if not tools_in_user_message is defined %}
-    {%- set tools_in_user_message = false %}
+    {%- set tool_definition = tool_definition ~ (tools | tojson(indent=4)) %}


Testing this chat template locally, there is a logic bug here that results in the tool_definition never making its way into the actual prompt. The tools from the ChatCompletion request come in as a tools variable, but we only set the tool_definition value if there's a custom_tools value passed in.

The line {%- set tool_definition = tool_definition ~ (tools | tojson(indent=4)) %} needs to move outside of this if statement block, and should happen after we check and set tools to none if not defined. Here's how the first few lines should look:

{{- bos_token }} {%- if custom_tools is defined and custom_tools%} {%- set tools = custom_tools %} {%- endif %} {%- if not tools is defined %} {%- set tools = none %} {%- endif %} {%- set tool_definition = tool_definition ~ (tools | tojson(indent=4)) %}

Without this change, the actual function definitions never get inserted into the model's prompt which results in it failing the majority of the bfclv3-api tests from llama-stack-evals repo.

Thanks for testing. You are right.. fixing this now!

Signed-off-by: Kai Wu <[email protected]>

wukaixingxp · 2025-05-22T19:46:14Z

Thanks to @bbrowning help, I just fixed a bug on the template, now tested with llama-stack-eval

================================================================ FAILURES =================================================================
__________________ test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] ___________________

request = <FixtureRequest for <Function test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid]>>
openai_client = <openai.OpenAI object at 0x7fe732426f90>, model = 'meta-llama/Llama-4-Scout-17B-16E-Instruct', provider = 'vllm'
verification_config = {'cerebras': ProviderConfig(provider='cerebras', base_url='https://api.cerebras.ai/v1', api_key_var='CEREBRAS_API_KEY'...verick-17B-128E-Instruct-FP8', canonical_id='Llama-4-Maverick-Instruct')], test_exclusions={}, self_hosted=False), ...}
case = {'case_id': 'tools_type_invalid', 'input': {'messages': [{'content': 'Which planet do humans live on?', 'role': 'user'}], 'tools': [{'type': 'invalid'}]}, 'output': {'error': {'status_code': 400}}}

    @pytest.mark.parametrize(
        "case",
        chat_completion_test_cases["test_chat_input_validation"]["test_params"]["case"],
        ids=case_id_generator,
    )
    def test_chat_non_streaming_error_handling(request, openai_client, model, provider, verification_config, case):
        test_name_base = get_base_test_name(request)
        if should_skip_test(verification_config, provider, model, test_name_base):
            pytest.skip(f"Skipping {test_name_base} for model {model} on provider {provider} based on config.")
    
        with pytest.raises(APIError) as e:
            openai_client.chat.completions.create(
                model=model,
                messages=case["input"]["messages"],
                stream=False,
                tool_choice=case["input"]["tool_choice"] if "tool_choice" in case["input"] else None,
                tools=case["input"]["tools"] if "tools" in case["input"] else None,
            )
>       assert case["output"]["error"]["status_code"] == e.value.status_code
E       AssertionError: assert 400 == 500
E        +  where 500 = InternalServerError('Error code: 500').status_code
E        +    where InternalServerError('Error code: 500') = <ExceptionInfo InternalServerError('Error code: 500') tblen=5>.value

llama_stack_evals/functional_tests/openai_api/test_chat_completion.py:686: AssertionError
____________________ test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] _____________________

request = <FixtureRequest for <Function test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid]>>
openai_client = <openai.OpenAI object at 0x7fe73258d6d0>, model = 'meta-llama/Llama-4-Scout-17B-16E-Instruct', provider = 'vllm'
verification_config = {'cerebras': ProviderConfig(provider='cerebras', base_url='https://api.cerebras.ai/v1', api_key_var='CEREBRAS_API_KEY'...verick-17B-128E-Instruct-FP8', canonical_id='Llama-4-Maverick-Instruct')], test_exclusions={}, self_hosted=False), ...}
case = {'case_id': 'tools_type_invalid', 'input': {'messages': [{'content': 'Which planet do humans live on?', 'role': 'user'}], 'tools': [{'type': 'invalid'}]}, 'output': {'error': {'status_code': 400}}}

    @pytest.mark.parametrize(
        "case",
        chat_completion_test_cases["test_chat_input_validation"]["test_params"]["case"],
        ids=case_id_generator,
    )
    def test_chat_streaming_error_handling(request, openai_client, model, provider, verification_config, case):
        test_name_base = get_base_test_name(request)
        if should_skip_test(verification_config, provider, model, test_name_base):
            pytest.skip(f"Skipping {test_name_base} for model {model} on provider {provider} based on config.")
    
        with pytest.raises(APIError) as e:
            response = openai_client.chat.completions.create(
                model=model,
                messages=case["input"]["messages"],
                stream=True,
                tool_choice=case["input"]["tool_choice"] if "tool_choice" in case["input"] else None,
                tools=case["input"]["tools"] if "tools" in case["input"] else None,
            )
            for _chunk in response:
                pass
>       assert str(case["output"]["error"]["status_code"]) in e.value.message
E       AssertionError: assert '400' in 'Error code: 500'
E        +  where '400' = str(400)
E        +  and   'Error code: 500' = InternalServerError('Error code: 500').message
E        +    where InternalServerError('Error code: 500') = <ExceptionInfo InternalServerError('Error code: 500') tblen=5>.value

llama_stack_evals/functional_tests/openai_api/test_chat_completion.py:709: AssertionError
========================================================= short test summary info =========================================================
FAILED llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_non_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] - AssertionError: assert 400 == 500
FAILED llama_stack_evals/functional_tests/openai_api/test_chat_completion.py::test_chat_streaming_error_handling[meta-llama/Llama-4-Scout-17B-16E-Instruct-tools_type_invalid] - AssertionError: assert '400' in 'Error code: 500'
=========================================== 2 failed, 22 passed, 5 xfailed, 5 xpassed in 30.94s ===========================================
Tests failed for provider=vllm, model=meta-llama/Llama-4-Scout-17B-16E-Instruct with exit code 1

…pythonic parser (vllm-project#17917) Signed-off-by: Kai Wu <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

* Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup (vllm-project#18337) * [Misc] Fix typo (vllm-project#18330) * Neuron up mistral (vllm-project#18222) Signed-off-by: Satyajith Chilappagari <[email protected]> * fix CUDA_check redefinition in vllm-project#17918 (vllm-project#18287) Signed-off-by: Lucia Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> * [neuron] fix authorization issue (vllm-project#18364) Signed-off-by: Liangfu Chen <[email protected]> * [Misc] Allow `AutoWeightsLoader` to skip loading weights with specific substr in name (vllm-project#18358) Signed-off-by: Isotr0py <[email protected]> * [Core] [Bugfix]: tensor parallel with prompt embeds (vllm-project#18171) Signed-off-by: Nan2018 <[email protected]> Co-authored-by: Andrew Sansom <[email protected]> * [release] Change dockerhub username for TPU release (vllm-project#18389) * [Bugfix] fix adding bias twice in ipex GPTQ quantization (vllm-project#18363) Signed-off-by: rand-fly <[email protected]> * [doc] update env variable export (vllm-project#18391) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Add LoRA code owner (vllm-project#18387) Signed-off-by: Jee Jee Li <[email protected]> * Update cpu.txt (vllm-project#18398) Signed-off-by: 汪志鹏 <[email protected]> * [CI] Add mteb testing to test the accuracy of the embedding model (vllm-project#17175) * [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407) Co-authored-by: 松灵 <[email protected]> * [Misc] refactor prompt embedding examples (vllm-project#18405) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Minor] Rename quantization nvfp4 to modelopt_fp4 (vllm-project#18356) Signed-off-by: mgoin <[email protected]> * [Model] use AutoWeightsLoader for bloom (vllm-project#18300) Signed-off-by: calvin chen <[email protected]> * [Kernel] update comment for KV shape in unified triton attn (vllm-project#18099) Signed-off-by: haochengxia <[email protected]> * fix:Build torch wheel inline rather than picking from nightly (vllm-project#18351) Signed-off-by: Dilip Gowda Bhagavan <[email protected]> * [TPU] Re-enable the Pallas MoE kernel (vllm-project#18025) Signed-off-by: Michael Goin <[email protected]> * [Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Bug] Fix moe_sum signature (vllm-project#18440) Signed-off-by: Bill Nell <[email protected]> * Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407)" (vllm-project#18456) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix][Failing Test] Fix nixl connector test when promt size < block size (vllm-project#18429) Signed-off-by: wwl2755 <[email protected]> * [Misc] MultiConnector._connectors type (vllm-project#18423) Signed-off-by: nicklucche <[email protected]> * [Frontend] deprecate `--device` arg (vllm-project#18399) Signed-off-by: Kebe <[email protected]> * [V1] Fix general plugins not loaded in engine for multiproc (vllm-project#18326) Signed-off-by: Yong Hoon Shin <[email protected]> * [Misc] refactor disaggregated-prefill-v1 example (vllm-project#18474) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Failing Test] Fix test_events.py (vllm-project#18460) Signed-off-by: rabi <[email protected]> * [MODEL] FalconH1 (vllm-project#18406) Signed-off-by: dhia.rhaiem <[email protected]> Co-authored-by: younesbelkada <[email protected]> Co-authored-by: Ilyas Chahed <[email protected]> Co-authored-by: Jingwei Zuo <[email protected]> * [Doc] fix arg docstring in linear layers (vllm-project#18410) Signed-off-by: giantcroc <[email protected]> * [Bugfix] Reduce moe_sum test size to avoid OOM (vllm-project#18484) Signed-off-by: Bill Nell <[email protected]> * [Build] fix Dockerfile shell (vllm-project#18402) * [Misc] Update deprecation message for `--enable-reasoning` (vllm-project#18404) * [ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (vllm-project#17004) Signed-off-by: Hosang Yoon <[email protected]> * Remove incorrect env value * Revert "[v1] Support multiple KV cache groups in GPU model runner (vllm-project#17945) (vllm-project#18459) Signed-off-by: Mark McLoughlin <[email protected]> * [FEAT][ROCm] Upgrade AITER MLA v1 backend (vllm-project#18338) Signed-off-by: vllmellm <[email protected]> Co-authored-by: Luka Govedič <[email protected]> * [Bugfix] Consistent ascii handling in tool parsers (vllm-project#17704) Signed-off-by: Sebastian Schönnenbeck <[email protected]> * [FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) (vllm-project#18500) Signed-off-by: dhia.rhaiem <[email protected]> Co-authored-by: younesbelkada <[email protected]> Co-authored-by: Ilyas Chahed <[email protected]> Co-authored-by: Jingwei Zuo <[email protected]> * [MISC] update project urls in pyproject.toml (vllm-project#18519) Signed-off-by: Andy Xie <[email protected]> * [CI] Fix race condition with StatelessProcessGroup.barrier (vllm-project#18506) Signed-off-by: Russell Bryant <[email protected]> * Intialize io_thread_pool attribute in the beginning. (vllm-project#18331) Signed-off-by: rabi <[email protected]> * [Bugfix] Inconsistent token calculation compared to HF in llava family (vllm-project#18479) Signed-off-by: jaycha <[email protected]> * [BugFix][DP] Send DP wave completion only from `dp_rank==0` (vllm-project#18502) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> * [Bugfix][Model] Make Olmo2Model weight loading return loaded weights (vllm-project#18504) Signed-off-by: Shane A <[email protected]> * [Bugfix] Fix LoRA test (vllm-project#18518) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Fix invalid JSON in example args (vllm-project#18527) Signed-off-by: DarkLight1337 <[email protected]> * [Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) (vllm-project#18512) Signed-off-by: Satyajith Chilappagari <[email protected]> * Update default neuron config for speculation (vllm-project#18274) Signed-off-by: Elaine Zhao <[email protected]> Co-authored-by: Shashwat Srijan <[email protected]> Co-authored-by: Aakash Shetty <[email protected]> * Order sequence ids + config update to support specifying custom quantization layers (vllm-project#18279) Signed-off-by: Elaine Zhao <[email protected]> Co-authored-by: Tailin Pan <[email protected]> Co-authored-by: Rishabh Rajesh <[email protected]> Co-authored-by: Yishan McNabb <[email protected]> Co-authored-by: Patrick Lange <[email protected]> Co-authored-by: Maxwell Goldberg <[email protected]> Co-authored-by: Aakash Shetty <[email protected]> * [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18526) Co-authored-by: 松灵 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible (vllm-project#18513) Signed-off-by: Linkun <[email protected]> * [CI/Build] Update bamba test model location (vllm-project#18544) Signed-off-by: Harry Mellor <[email protected]> * [Doc] Support --stream arg in openai_completion_client.py script (vllm-project#18388) Signed-off-by: googs1025 <[email protected]> * [Bugfix] Use random hidden states in dummy sampler run (vllm-project#18543) Signed-off-by: Bowen Wang <[email protected]> * [Doc] Add stream flag for chat completion example (vllm-project#18524) Signed-off-by: calvin chen <[email protected]> * [BugFix][CPU] Fix x86 SHM distributed module initialization (vllm-project#18536) Signed-off-by: jiang.li <[email protected]> * [Misc] improve Automatic Prefix Caching example (vllm-project#18554) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Call `ndarray.tobytes()` directly instead of `ndarray.data.tobytes()` (vllm-project#18347) Signed-off-by: Lukas Geiger <[email protected]> * [Bugfix] make `test_openai_schema.py` pass (vllm-project#18224) Signed-off-by: David Xia <[email protected]> Co-authored-by: Harry Mellor <[email protected]> * [Platform] Move platform check to right place (vllm-project#18470) Signed-off-by: wangxiyuan <[email protected]> * [Compile][Platform] Make PiecewiseBackend pluggable and extendable (vllm-project#18076) Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Build/CI] Fix CUDA 11.8 build (vllm-project#17679) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> * [Tool] Add NIXL installation script (vllm-project#18172) Signed-off-by: Linkun <[email protected]> * [V1][Spec Decode][Bugfix] Load quantize weights for EAGLE (vllm-project#18290) * [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser (vllm-project#17917) Signed-off-by: Kai Wu <[email protected]> * [Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization (vllm-project#17926) Signed-off-by: Sanger Steel <[email protected]> * [AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh (vllm-project#18568) Signed-off-by: Randall Smith <[email protected]> * Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. (vllm-project#18569) Signed-off-by: Chenheli Hua <[email protected]> * [V1][Spec Decoding] Use model_loader.get_model() to load models (vllm-project#18273) Signed-off-by: Mark McLoughlin <[email protected]> * Enable hybrid attention models for Transformers backend (vllm-project#18494) Signed-off-by: Harry Mellor <[email protected]> * [Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs (vllm-project#18482) Signed-off-by: googs1025 <[email protected]> * [BugFix] Increase TP execute_model timeout (vllm-project#18558) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Set `KVTransferConfig.engine_id` in post_init (vllm-project#18576) Signed-off-by: Linkun Chen <[email protected]> * [Spec Decode] Make EAGLE3 draft token ID mapping optional (vllm-project#18488) Signed-off-by: Benjamin Chislett <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * [Neuron] Remove bypass on EAGLEConfig and add a test (vllm-project#18514) Signed-off-by: Elaine Zhao <[email protected]> * [Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key (vllm-project#17291) Signed-off-by: Teruaki Ishizaki <[email protected]> * [Misc] Replace `cuda` hard code with `current_platform` (vllm-project#16983) Signed-off-by: shen-shanshan <[email protected]> * [Hardware] correct method signatures for HPU,ROCm,XPU (vllm-project#18551) Signed-off-by: Andy Xie <[email protected]> * [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034) Signed-off-by: Ronald Xu <[email protected]> * [Feature]Add async tensor parallelism using compilation pass (vllm-project#17882) Signed-off-by: cascade812 <[email protected]> * [Doc] Update quickstart and install for cu128 using `--torch-backend=auto` (vllm-project#18505) Signed-off-by: mgoin <[email protected]> * [Feature][V1]: suupports cached_tokens in response usage (vllm-project#18149) Co-authored-by: simon-mo <[email protected]> * [Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform (vllm-project#18430) Signed-off-by: Yuqi Zhang <[email protected]> Co-authored-by: Yuqi Zhang <[email protected]> * Migrate docs from Sphinx to MkDocs (vllm-project#18145) Signed-off-by: Harry Mellor <[email protected]> * Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034)" (vllm-project#18600) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix][Model] Fix baichuan model loader for tp (vllm-project#18597) Signed-off-by: Mengqing Cao <[email protected]> * [V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled (vllm-project#17731) Signed-off-by: Madeesh Kannan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> * Add myself as docs code owner (vllm-project#18605) Signed-off-by: Harry Mellor <[email protected]> * [Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to `requirements/cpu.txt` (vllm-project#18542) Signed-off-by: Kay Yan <[email protected]> * [CI] fix kv_cache_type argument (vllm-project#18594) Signed-off-by: Andy Xie <[email protected]> * [Doc] Fix indent of contributing to vllm (vllm-project#18611) Signed-off-by: Zerohertz <[email protected]> * Replace `{func}` with mkdocs style links (vllm-project#18610) Signed-off-by: Harry Mellor <[email protected]> * [CI/Build] Fix V1 flag being set in entrypoints tests (vllm-project#18598) Signed-off-by: DarkLight1337 <[email protected]> * Fix examples with code blocks in docs (vllm-project#18609) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix transformers model impl ignored for mixtral quant (vllm-project#18602) Signed-off-by: Tristan Leclercq <[email protected]> * Include private attributes in API documentation (vllm-project#18614) Signed-off-by: Harry Mellor <[email protected]> * [Misc] add Haystack integration (vllm-project#18601) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS (vllm-project#18579) * [Doc] Fix markdown list indentation for MkDocs rendering (vllm-project#18620) Signed-off-by: Zerohertz <[email protected]> * [Doc] Use a different color for the announcement (vllm-project#18616) Signed-off-by: DarkLight1337 <[email protected]> * Refactor pplx init logic to make it modular (prepare for deepep) (vllm-project#18200) Signed-off-by: youkaichao <[email protected]> * Fix figures in design doc (vllm-project#18612) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Change mkdocs to not use directory urls (vllm-project#18622) Signed-off-by: mgoin <[email protected]> * [v1] Redo "Support multiple KV cache groups in GPU model runner (vllm-project#17945)" (vllm-project#18593) Signed-off-by: Chen Zhang <[email protected]> * [Doc] fix list formatting (vllm-project#18624) Signed-off-by: David Xia <[email protected]> * [Doc] Fix top-level API links/docs (vllm-project#18621) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Avoid documenting dynamic / internal modules (vllm-project#18626) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar (vllm-project#18627) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Support Deepseek MTP (vllm-project#18435) Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: YaoJiayi <[email protected]> Co-authored-by: Rui Qiao <[email protected]> * Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI (vllm-project#18537) Signed-off-by: Huy Do <[email protected]> * [CI] Enable test_initialization to run on V1 (vllm-project#16736) Signed-off-by: mgoin <[email protected]> * [Doc] Update references to doc files (vllm-project#18637) Signed-off-by: DarkLight1337 <[email protected]> * [ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation (vllm-project#18160) Signed-off-by: Pavani Majety <[email protected]> * [Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking (vllm-project#18454) Signed-off-by: Crucifixion-Fxl <[email protected]> Co-authored-by: Crucifixion-Fxl <[email protected]> * [Bugfix][Nixl] Fix Preemption Bug (vllm-project#18631) Signed-off-by: [email protected] <[email protected]> * config.py: Clarify that only local GGUF checkpoints are supported. (vllm-project#18623) Signed-off-by: Mathieu Bordere <[email protected]> * FIX MOE issue in AutoRound format (vllm-project#18586) Signed-off-by: wenhuach21 <[email protected]> * [V1][Spec Decode] Small refactors to improve eagle bookkeeping performance (vllm-project#18424) Signed-off-by: qizixi <[email protected]> * [Frontend] improve vllm serve --help display (vllm-project#18643) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) (vllm-project#18647) * [V1][Spec Decode] Support multi-layer eagle draft model (vllm-project#18030) Signed-off-by: qizixi <[email protected]> * [Doc] Update README links, mark external links (vllm-project#18635) Signed-off-by: DarkLight1337 <[email protected]> * [MISC][pre-commit] Add pre-commit check for triton import (vllm-project#17716) Signed-off-by: Mengqing Cao <[email protected]> * [Doc] Fix indentation problems in V0 Paged Attention docs (vllm-project#18659) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Add community links (vllm-project#18657) Signed-off-by: DarkLight1337 <[email protected]> * [Model] use AutoWeightsLoader for gpt2 (vllm-project#18625) Signed-off-by: zt2370 <[email protected]> * [Doc] Reorganize user guide (vllm-project#18661) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build] `chmod +x` to `cleanup_pr_body.sh` (vllm-project#18650) Signed-off-by: DarkLight1337 <[email protected]> * [MISC] typo fix and clean import (vllm-project#18664) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Fix import error for fused_moe (vllm-project#18642) Signed-off-by: wangxiyuan <[email protected]> * [CI] enforce import regex instead of re (vllm-project#18665) Signed-off-by: Aaron Pham <[email protected]> * fix(regression): clone from reference items (vllm-project#18662) Signed-off-by: Aaron Pham <[email protected]> * [CI/Build] fix permission denied issue (vllm-project#18645) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding (vllm-project#18668) Signed-off-by: Woosuk Kwon <[email protected]> * [V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... (vllm-project#18640) Signed-off-by: Seiji Eicher <[email protected]> * [MISC] correct signature for LoaderFunction (vllm-project#18670) Signed-off-by: Andy Xie <[email protected]> * [Misc]Replace `cuda` hard code with `current_platform` in Ray (vllm-project#14668) Signed-off-by: noemotiovon <[email protected]> * [Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE (vllm-project#18655) Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [VLM] Initialize video input support for InternVL models (vllm-project#18499) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Speed up the `kernels/quantization/` tests (vllm-project#18669) Signed-off-by: mgoin <[email protected]> * [BUGFIX] catch subclass first for try...except (vllm-project#18672) Signed-off-by: Andy Xie <[email protected]> * [Misc] Reduce logs on startup (vllm-project#18649) Signed-off-by: DarkLight1337 <[email protected]> * [doc] fix broken links (vllm-project#18671) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [doc] improve readability (vllm-project#18675) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (vllm-project#18674) Signed-off-by: zzzyq <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CI/build] fix no regex (vllm-project#18676) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] small improve (vllm-project#18680) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix profiling dummy data for Pixtral (vllm-project#18677) Signed-off-by: DarkLight1337 <[email protected]> * [Core][Multimodal] Convert PIL Image to array without data copy when hashing (vllm-project#18682) Signed-off-by: Lukas Geiger <[email protected]> * [CI/Build][Doc] Update `gte-Qwen2-1.5B-instruct` usage (vllm-project#18683) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example (vllm-project#18644) Signed-off-by: zhaohaidao <[email protected]> Signed-off-by: zhaohaiyuan <[email protected]> Co-authored-by: zhaohaiyuan <[email protected]> * refactor: simplify request handler, use positive condition check for handler assignment (vllm-project#18690) Signed-off-by: googs1025 <[email protected]> * [Bugfix] Fix the lm_head in gpt_bigcode in lora mode (vllm-project#6357) Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Max de Bayser <[email protected]> * [CI] add missing argument (vllm-project#18694) Signed-off-by: Andy Xie <[email protected]> * [GH] Add issue template for reporting CI failures (vllm-project#18696) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix issue template format (vllm-project#18699) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix Mistral-format models with sliding window (vllm-project#18693) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build] Replace `math.isclose` with `pytest.approx` (vllm-project#18703) Signed-off-by: DarkLight1337 <[email protected]> * [CI] fix dump_input for str type (vllm-project#18697) Signed-off-by: Andy Xie <[email protected]> * [Model] Add support for YARN in NemotronNAS models (vllm-project#18427) Signed-off-by: Nave Assaf <[email protected]> * [CI/Build] Split pooling and generation extended language models tests in CI (vllm-project#18705) Signed-off-by: Isotr0py <[email protected]> * [Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI (vllm-project#18709) Signed-off-by: Lukasz Durejko <[email protected]> * [Misc] add AutoGen integration (vllm-project#18712) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM (vllm-project#18701) * [Doc] Improve API docs (vllm-project#18713) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Move examples and further reorganize user guide (vllm-project#18666) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix Llama GGUF initialization (vllm-project#18717) Signed-off-by: DarkLight1337 <[email protected]> * [V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs (vllm-project#18608) * Convert `examples` to `ruff-format` (vllm-project#18400) Signed-off-by: Harry Mellor <[email protected]> * [Model][Gemma3] Simplify image input validation (vllm-project#18710) Signed-off-by: Lukas Geiger <[email protected]> * [Misc] improve web section group title display (vllm-project#18684) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [V1][Quantization] Add CUDA graph compatible v1 GGUF support (vllm-project#18646) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Model][Gemma3] Cast image pixel values already on CPU (vllm-project#18732) Signed-off-by: Lukas Geiger <[email protected]> * [FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (vllm-project#18271) Signed-off-by: vllmellm <[email protected]> * [Doc] Update OOT model docs (vllm-project#18742) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Update reproducibility doc and example (vllm-project#18741) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] improve docs (vllm-project#18734) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * feat(rocm-support): support mamba2 on rocm (vllm-project#18565) Signed-off-by: Islam Almersawi <[email protected]> Co-authored-by: Islam Almersawi <[email protected]> * [Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh (vllm-project#18752) Signed-off-by: Lukasz Durejko <[email protected]> * [Doc] cleanup deprecated flag for doc (vllm-project#18715) Signed-off-by: calvin chen <[email protected]> * Minor fix about MooncakeStoreConnector (vllm-project#18721) Signed-off-by: baoloongmao <[email protected]> * [Build] fix cpu build missing libtbbmalloc.so (vllm-project#18744) Signed-off-by: Kebe <[email protected]> * [BUG FIX] minicpm (vllm-project#18739) Signed-off-by: huangyuxiang03 <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> * [Doc] Convert Sphinx directives ( `{class}`, `{meth}`, `{attr}`, ...) to MkDocs format for better documentation linking (vllm-project#18663) Signed-off-by: Zerohertz <[email protected]> * [CI/Build] Remove imports of built-in `re` (vllm-project#18750) Signed-off-by: DarkLight1337 <[email protected]> * [V1][Metrics] Add API for accessing in-memory Prometheus metrics (vllm-project#17010) Signed-off-by: Mark McLoughlin <[email protected]> * Disable prefix cache by default for benchmark (vllm-project#18639) Signed-off-by: cascade812 <[email protected]> * optimize get_kv_cache_torch_dtype (vllm-project#18531) Signed-off-by: idellzheng <[email protected]> * [Core] Automatically cast multi-modal input dtype (vllm-project#18756) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Mistral tool calling when content is list (vllm-project#18729) Signed-off-by: mgoin <[email protected]> --------- Signed-off-by: Satyajith Chilappagari <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Nan2018 <[email protected]> Signed-off-by: rand-fly <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: haochengxia <[email protected]> Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Michael Goin <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: wwl2755 <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: Kebe <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: rabi <[email protected]> Signed-off-by: dhia.rhaiem <[email protected]> Signed-off-by: giantcroc <[email protected]> Signed-off-by: Hosang Yoon <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: Sebastian Schönnenbeck <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: jaycha <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Shane A <[email protected]> Signed-off-by: Elaine Zhao <[email protected]> Signed-off-by: Linkun <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Kai Wu <[email protected]> Signed-off-by: Sanger Steel <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Chenheli Hua <[email protected]> Signed-off-by: Linkun Chen <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Teruaki Ishizaki <[email protected]> Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Ronald Xu <[email protected]> Signed-off-by: cascade812 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]> Signed-off-by: Madeesh Kannan <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Tristan Leclercq <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: YaoJiayi <[email protected]> Signed-off-by: Huy Do <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Crucifixion-Fxl <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Mathieu Bordere <[email protected]> Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: zt2370 <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: noemotiovon <[email protected]> Signed-off-by: zzzyq <[email protected]> Signed-off-by: zhaohaidao <[email protected]> Signed-off-by: zhaohaiyuan <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Nave Assaf <[email protected]> Signed-off-by: Lukasz Durejko <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Islam Almersawi <[email protected]> Signed-off-by: baoloongmao <[email protected]> Signed-off-by: huangyuxiang03 <[email protected]> Signed-off-by: idellzheng <[email protected]> Co-authored-by: sunyicode0012 <[email protected]> Co-authored-by: Gong Shufan <[email protected]> Co-authored-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Nan Qin <[email protected]> Co-authored-by: Andrew Sansom <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Random Fly <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: 燃 <[email protected]> Co-authored-by: 松灵 <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Percy <[email protected]> Co-authored-by: Dilip Gowda Bhagavan <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: wwl2755 <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Rabi Mishra <[email protected]> Co-authored-by: Dhia Eddine Rhaiem <[email protected]> Co-authored-by: younesbelkada <[email protected]> Co-authored-by: Ilyas Chahed <[email protected]> Co-authored-by: Jingwei Zuo <[email protected]> Co-authored-by: GiantCroc <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Hosang <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Sebastian Schoennenbeck <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: youngrok cha <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Shane A <[email protected]> Co-authored-by: aws-elaineyz <[email protected]> Co-authored-by: Shashwat Srijan <[email protected]> Co-authored-by: Aakash Shetty <[email protected]> Co-authored-by: Tailin Pan <[email protected]> Co-authored-by: Rishabh Rajesh <[email protected]> Co-authored-by: Yishan McNabb <[email protected]> Co-authored-by: Patrick Lange <[email protected]> Co-authored-by: Maxwell Goldberg <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: lkchen <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Kai Wu <[email protected]> Co-authored-by: Sanger Steel <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Chenheli Hua <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Teruaki Ishizaki <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: RonaldBXu <[email protected]> Co-authored-by: cascade <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: Yuqi Zhang <[email protected]> Co-authored-by: Yuqi Zhang <[email protected]> Co-authored-by: Madeesh Kannan <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: Tristan Leclercq <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Jiayi Yao <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Feng XiaoLong <[email protected]> Co-authored-by: Crucifixion-Fxl <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Mathieu Borderé <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Yuanhao WU <[email protected]> Co-authored-by: ztang2370 <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: AlexZhao <[email protected]> Co-authored-by: zhaohaiyuan <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Naveassaf <[email protected]> Co-authored-by: Łukasz Durejko <[email protected]> Co-authored-by: dylan <[email protected]> Co-authored-by: almersawi <[email protected]> Co-authored-by: Islam Almersawi <[email protected]> Co-authored-by: Łukasz Durejko <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Shawn Huang <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> Co-authored-by: chunxiaozheng <[email protected]>

…pythonic parser (vllm-project#17917) Signed-off-by: Kai Wu <[email protected]> Signed-off-by: minpeter <[email protected]>

mergify bot added documentation Improvements or additions to documentation frontend tool-calling labels May 9, 2025

github-project-automation bot added this to Tool Calling May 9, 2025

houseroad requested a review from yeqcharlotte May 9, 2025 18:55

yeqcharlotte reviewed May 9, 2025

View reviewed changes

examples/tool_chat_template_llama4_pythonic.jinja Outdated Show resolved Hide resolved

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py Outdated Show resolved Hide resolved

bbrowning reviewed May 13, 2025

View reviewed changes

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py Outdated Show resolved Hide resolved

bbrowning mentioned this pull request May 14, 2025

fix: multiple tool calls in remote-vllm chat_completion llamastack/llama-stack#2161

Merged

wukaixingxp marked this pull request as ready for review May 17, 2025 01:25

wukaixingxp requested a review from yeqcharlotte May 17, 2025 01:25

wukaixingxp changed the title ~~WIP: fix_llama4_tool_call~~ [Frontend] Update llama4 pythonic jinja template and llama4_pythonic parser May 17, 2025

wukaixingxp requested a review from bbrowning May 20, 2025 22:26

rebased to solve DCO problem

0471aed

Signed-off-by: Kai Wu <[email protected]>

wukaixingxp force-pushed the fix_llama4_tool_call branch from 5be9c0e to 0471aed Compare May 20, 2025 23:06

update template and tests

8e6c2bf

Signed-off-by: Kai Wu <[email protected]>

wukaixingxp requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners May 21, 2025 05:22

fix readme and init file

5ae3ffb

Signed-off-by: Kai Wu <[email protected]>

aarnphm approved these changes May 21, 2025

View reviewed changes

examples/tool_chat_template_llama4_pythonic.jinja Show resolved Hide resolved

examples/tool_chat_template_llama4_pythonic.jinja Outdated Show resolved Hide resolved

wukaixingxp added 3 commits May 21, 2025 08:43

fix tests and jinja

1ad3b71

Signed-off-by: Kai Wu <[email protected]>

fix tests

1686299

Signed-off-by: Kai Wu <[email protected]>

fix tests

720dbe0

Signed-off-by: Kai Wu <[email protected]>

wukaixingxp requested a review from yeqcharlotte May 21, 2025 17:02

fix tests

43eacd3

Signed-off-by: Kai Wu <[email protected]>

yeqcharlotte approved these changes May 21, 2025

View reviewed changes

houseroad approved these changes May 21, 2025

View reviewed changes

houseroad changed the title ~~[Frontend] Update llama4 pythonic jinja template and llama4_pythonic parser~~ [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser May 21, 2025

Merge branch 'vllm-project:main' into fix_llama4_tool_call

6bd51c0

bbrowning reviewed May 22, 2025

View reviewed changes

fix jinja tool

8a32e2a

Signed-off-by: Kai Wu <[email protected]>

houseroad merged commit c91fe7b into vllm-project:main May 22, 2025
64 checks passed

github-project-automation bot moved this to Done in Tool Calling May 22, 2025

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_…

7a01476

…pythonic parser (vllm-project#17917) Signed-off-by: Kai Wu <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

upfixer mentioned this pull request May 28, 2025

update llama4 chat template and pythonic parser sgl-project/sglang#6679

Merged

6 tasks

yeqcharlotte mentioned this pull request May 31, 2025

[Usage]: Llama4 tool parser #16214

Open

1 task

Uh oh!

[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser #17917

[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser #17917

Uh oh!

Conversation

wukaixingxp commented May 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeqcharlotte commented May 14, 2025

Uh oh!

FWao commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wukaixingxp commented May 15, 2025

Uh oh!

wukaixingxp commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented May 16, 2025

Uh oh!

wukaixingxp commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented May 16, 2025

Uh oh!

wukaixingxp commented May 16, 2025

Uh oh!

wukaixingxp commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wukaixingxp commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wukaixingxp commented May 19, 2025

Uh oh!

DarkLight1337 commented May 21, 2025

Uh oh!

aarnphm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wukaixingxp commented May 21, 2025

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wukaixingxp commented May 22, 2025

Uh oh!

houseroad commented May 22, 2025

Uh oh!

bbrowning May 22, 2025

Choose a reason for hiding this comment

Uh oh!

wukaixingxp May 22, 2025

Choose a reason for hiding this comment

Uh oh!

wukaixingxp commented May 22, 2025

Uh oh!

Uh oh!

Uh oh!

wukaixingxp commented May 9, 2025 •

edited by github-actions bot

Loading

FWao commented May 15, 2025 •

edited

Loading

wukaixingxp commented May 15, 2025 •

edited

Loading

wukaixingxp commented May 16, 2025 •

edited

Loading

wukaixingxp commented May 16, 2025 •

edited

Loading

wukaixingxp commented May 16, 2025 •

edited

Loading

aarnphm left a comment •

edited

Loading

DarkLight1337 commented May 21, 2025 •

edited

Loading