[Feature][Frontend]: Deprecate --enable-reasoning #17452

chaunceyjiang · 2025-04-30T06:56:16Z

warning test :

# vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3
INFO 04-30 06:47:00 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-30 06:47:04 [arg_utils.py:60] The parameter --enable-reasoning is deprecated.
INFO 04-30 06:47:04 [api_server.py:1042] vLLM API server version 0.8.5.dev84+g9c1d5b456
INFO 04-30 06:47:04 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='Qwen/Qwen3-8B', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen3-8B', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, guided_decoding_backend='xgrammar', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, reasoning_parser='qwen3', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, use_v2_block_manager=True, disable_log_stats=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', additional_config=None, enable_reasoning=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f7abd0e1da0>)
....

help test

# vllm serve --help
  --reasoning-parser {deepseek_r1,granite,qwen3}
                        Select the reasoning parser depending on the model that you're using. This is used to parse the reasoning content into OpenAI API format. (default: )

test:

# vllm serve Qwen/Qwen3-8B  --reasoning-parser qwen3
...
INFO 04-30 08:42:01 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /invocations, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [4133173]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

clent

from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"

class Step(BaseModel):
    ground_truth_key_ideas: str
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float


if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # client.chat.completions.create
    json_schema = Step.model_json_schema()

    chat_response = client.beta.chat.completions.parse(
        model="",
        messages=[
            {'role': 'system',
            'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
            {'role': 'user',
            'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
        ],
        temperature=0.0,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print("-----")
    print("Chat response:", chat_response.choices[0].message.reasoning_content)

output:

# python test.py
Chat response: 
Okay, let's tackle this. First, I need to extract the key ideas from the ground truth and the system response. 

Starting with the ground truth. The main points are: the action space equals the vocabulary size, which is large (tens of thousands of tokens). Then it mentions that real-world locomotion can be condensed into three axes or combinations. Also, the authors note that typical RL problems have smaller action spaces but don't explain how they're modeled.
.....

github-actions · 2025-04-30T07:09:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tests/entrypoints/openai/test_cli_args.py

vllm/config.py

vllm/engine/arg_utils.py

vllm/entrypoints/openai/serving_chat.py

gaocegege · 2025-04-30T09:06:54Z

/cc @aarnphm

Maybe you are interested.

gaocegege · 2025-04-30T09:11:45Z

Can you also have a manual test without the reasoning parser?

vllm serve Qwen/Qwen3-8B

aarnphm

Overall lgtm, but let me finish the other PR wrt to deprecated tags in args

vllm/entrypoints/openai/serving_chat.py

vllm/engine/arg_utils.py

vllm/entrypoints/openai/serving_chat.py

hmellor

.

chaunceyjiang · 2025-04-30T15:12:24Z

Test

vllm serve Qwen/Qwen3-8B

  curl -X POST http://localhost:8000/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{ 
             "messages": [{"role": "user", "content": "Hello, vLLM!"}],
             "max_tokens": 240
           }' |jq
{
  "id": "chatcmpl-24eee75f346e44c19342da94081275f8",
  "object": "chat.completion",
  "created": 1746025506,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\nOkay, the user said \"Hello, vLLM!\" and I need to respond appropriately. First, I should acknowledge their greeting. Since vLLM is a large language model, I should mention that I'm a Qwen model. Maybe they confused the name, so I should clarify that. I should keep the response friendly and open-ended, inviting them to ask questions. Let me check if there's anything else I need to consider. No, just a simple greeting and clarification should do. Alright, time to put it all together.\n</think>\n\nHello! I'm Qwen, a large language model developed by Alibaba Cloud. If you have any questions or need assistance, feel free to ask me! 😊",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 161,
    "completion_tokens": 147,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

test

vllm serve Qwen/Qwen3-8B --reasoning-parser qwen3

  curl -X POST http://localhost:8000/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{ 
             "messages": [{"role": "user", "content": "Hello, vLLM!"}],
             "max_tokens": 240
           }' |jq
{
  "id": "chatcmpl-ab9fdf511b3f4791ac4dd52a6b6d42b2",
  "object": "chat.completion",
  "created": 1746025382,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "\nOkay, the user said \"Hello, vLLM!\" So first, I need to figure out what they're asking. They might be greeting me, but I should check if they're referring to the vLLM framework or maybe another model. Wait, I'm Qwen, not vLLM. Maybe they confused me with vLLM. I should clarify that. Let me make sure I respond correctly. I should greet them back and explain that I'm Qwen, not vLLM. Then offer assistance with their needs. Keep it friendly and helpful.\n",
        "content": "\n\nHello! I'm Qwen, a large language model developed by Alibaba Cloud. I'm not vLLM, but I'm here to help you with any questions or tasks you might have. How can I assist you today? 😊",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 182,
    "completion_tokens": 168,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

hmellor · 2025-05-01T07:52:38Z

Failing V1 test looks unrelated and has been fixed in #17500

Let's wait and see for the entrypoints test

DarkLight1337 · 2025-05-01T09:06:07Z

The test failure is persistent, PTAL

Signed-off-by: chaunceyjiang <[email protected]>

vllm/engine/arg_utils.py

chaunceyjiang · 2025-05-01T13:42:29Z

The test case distributed-tests-2-gpus seems unrelated to my PR.

Signed-off-by: chaunceyjiang <[email protected]>

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: minpeter <[email protected]>

mergify bot added documentation Improvements or additions to documentation frontend structured-output tool-calling labels Apr 30, 2025

github-project-automation bot added this to Structured Output and Tool Calling Apr 30, 2025

chaunceyjiang force-pushed the remove-enable-reasoning branch from 17fe201 to fc5191c Compare April 30, 2025 07:04

chaunceyjiang marked this pull request as ready for review April 30, 2025 08:39

chaunceyjiang requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, mgoin, russellb, zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners April 30, 2025 08:39

hmellor reviewed Apr 30, 2025

View reviewed changes

aarnphm approved these changes Apr 30, 2025

View reviewed changes

vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

hmellor reviewed Apr 30, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

hmellor reviewed Apr 30, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved

hmellor requested changes Apr 30, 2025

View reviewed changes

chaunceyjiang requested review from hmellor, aarnphm and gaocegege April 30, 2025 14:51

chaunceyjiang added 15 commits May 1, 2025 10:36

[Feature][Frontend]: Deprecate --enable-reasoning

12893c5

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

affaa9a

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

c439d07

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

9384973

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

6a27669

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

53f5cfd

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

aa92d33

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

df72279

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

c437fc7

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

1292587

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

bed5fc3

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

da5d4d0

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

a12c016

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

b2d9bde

Signed-off-by: chaunceyjiang <[email protected]>

[Feature][Frontend]: Deprecate --enable-reasoning

91704e2

Signed-off-by: chaunceyjiang <[email protected]>

auto-merge was automatically disabled May 1, 2025 10:37
Head branch was pushed to by a user without write access

chaunceyjiang force-pushed the remove-enable-reasoning branch from 36b464b to 91704e2 Compare May 1, 2025 10:37

aarnphm reviewed May 1, 2025

View reviewed changes

vllm/engine/arg_utils.py Show resolved Hide resolved

vllm-bot merged commit 98060b0 into vllm-project:main May 1, 2025
54 of 56 checks passed

github-project-automation bot moved this to Done in Structured Output May 1, 2025

github-project-automation bot moved this to Done in Tool Calling May 1, 2025

chaunceyjiang deleted the remove-enable-reasoning branch May 1, 2025 13:48

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452)

2f19d20

Signed-off-by: chaunceyjiang <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452)

80c24af

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Mu Huai <[email protected]>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452)

ff6cc89

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452)

f8096a6

Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: minpeter <[email protected]>

Uh oh!

[Feature][Frontend]: Deprecate --enable-reasoning #17452

[Feature][Frontend]: Deprecate --enable-reasoning #17452

Uh oh!

Conversation

chaunceyjiang commented Apr 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaocegege commented Apr 30, 2025

Uh oh!

gaocegege commented Apr 30, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented Apr 30, 2025

Uh oh!

hmellor commented May 1, 2025

Uh oh!

DarkLight1337 commented May 1, 2025

Uh oh!

Uh oh!

chaunceyjiang commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Apr 30, 2025 •

edited by github-actions bot

Loading