Skip to content

Conversation

chaunceyjiang
Copy link
Collaborator

@chaunceyjiang chaunceyjiang commented Apr 30, 2025

Fix #14088

warning test :

# vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3
INFO 04-30 06:47:00 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-30 06:47:04 [arg_utils.py:60] The parameter --enable-reasoning is deprecated.
INFO 04-30 06:47:04 [api_server.py:1042] vLLM API server version 0.8.5.dev84+g9c1d5b456
INFO 04-30 06:47:04 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='Qwen/Qwen3-8B', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen3-8B', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, guided_decoding_backend='xgrammar', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, reasoning_parser='qwen3', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, use_v2_block_manager=True, disable_log_stats=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', additional_config=None, enable_reasoning=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f7abd0e1da0>)
....

help test

# vllm serve --help
  --reasoning-parser {deepseek_r1,granite,qwen3}
                        Select the reasoning parser depending on the model that you're using. This is used to parse the reasoning content into OpenAI API format. (default: )

test:

# vllm serve Qwen/Qwen3-8B  --reasoning-parser qwen3
...
INFO 04-30 08:42:01 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /invocations, Methods: POST
INFO 04-30 08:42:01 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [4133173]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

clent

from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"

class Step(BaseModel):
    ground_truth_key_ideas: str
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float


if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # client.chat.completions.create
    json_schema = Step.model_json_schema()

    chat_response = client.beta.chat.completions.parse(
        model="",
        messages=[
            {'role': 'system',
            'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
            {'role': 'user',
            'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
        ],
        temperature=0.0,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print("-----")
    print("Chat response:", chat_response.choices[0].message.reasoning_content)

output:

# python test.py
Chat response: 
Okay, let's tackle this. First, I need to extract the key ideas from the ground truth and the system response. 

Starting with the ground truth. The main points are: the action space equals the vocabulary size, which is large (tens of thousands of tokens). Then it mentions that real-world locomotion can be condensed into three axes or combinations. Also, the authors note that typical RL problems have smaller action spaces but don't explain how they're modeled.
.....

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@gaocegege
Copy link
Contributor

/cc @aarnphm

Maybe you are interested.

@gaocegege
Copy link
Contributor

Can you also have a manual test without the reasoning parser?

vllm serve Qwen/Qwen3-8B

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, but let me finish the other PR wrt to deprecated tags in args

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@chaunceyjiang
Copy link
Collaborator Author

Test

vllm serve Qwen/Qwen3-8B
  curl -X POST http://localhost:8000/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{ 
             "messages": [{"role": "user", "content": "Hello, vLLM!"}],
             "max_tokens": 240
           }' |jq
{
  "id": "chatcmpl-24eee75f346e44c19342da94081275f8",
  "object": "chat.completion",
  "created": 1746025506,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\nOkay, the user said \"Hello, vLLM!\" and I need to respond appropriately. First, I should acknowledge their greeting. Since vLLM is a large language model, I should mention that I'm a Qwen model. Maybe they confused the name, so I should clarify that. I should keep the response friendly and open-ended, inviting them to ask questions. Let me check if there's anything else I need to consider. No, just a simple greeting and clarification should do. Alright, time to put it all together.\n</think>\n\nHello! I'm Qwen, a large language model developed by Alibaba Cloud. If you have any questions or need assistance, feel free to ask me! 😊",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 161,
    "completion_tokens": 147,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

test

vllm serve Qwen/Qwen3-8B --reasoning-parser qwen3
  curl -X POST http://localhost:8000/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{ 
             "messages": [{"role": "user", "content": "Hello, vLLM!"}],
             "max_tokens": 240
           }' |jq
{
  "id": "chatcmpl-ab9fdf511b3f4791ac4dd52a6b6d42b2",
  "object": "chat.completion",
  "created": 1746025382,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "\nOkay, the user said \"Hello, vLLM!\" So first, I need to figure out what they're asking. They might be greeting me, but I should check if they're referring to the vLLM framework or maybe another model. Wait, I'm Qwen, not vLLM. Maybe they confused me with vLLM. I should clarify that. Let me make sure I respond correctly. I should greet them back and explain that I'm Qwen, not vLLM. Then offer assistance with their needs. Keep it friendly and helpful.\n",
        "content": "\n\nHello! I'm Qwen, a large language model developed by Alibaba Cloud. I'm not vLLM, but I'm here to help you with any questions or tasks you might have. How can I assist you today? 😊",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 182,
    "completion_tokens": 168,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

@hmellor
Copy link
Member

hmellor commented May 1, 2025

Failing V1 test looks unrelated and has been fixed in #17500

Let's wait and see for the entrypoints test

@DarkLight1337
Copy link
Member

The test failure is persistent, PTAL

auto-merge was automatically disabled May 1, 2025 10:37

Head branch was pushed to by a user without write access

@chaunceyjiang chaunceyjiang force-pushed the remove-enable-reasoning branch from 36b464b to 91704e2 Compare May 1, 2025 10:37
@chaunceyjiang
Copy link
Collaborator Author

The test case distributed-tests-2-gpus seems unrelated to my PR.

@vllm-bot vllm-bot merged commit 98060b0 into vllm-project:main May 1, 2025
54 of 56 checks passed
@chaunceyjiang chaunceyjiang deleted the remove-enable-reasoning branch May 1, 2025 13:48
radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed structured-output tool-calling
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature][Frontend]: Deprecate --enable-reasoning
6 participants