Skip to content

Conversation

TheCodeWrangler
Copy link

@TheCodeWrangler TheCodeWrangler commented Apr 24, 2025

Add Priority Request Support for vLLM Async Engine

Description

This PR adds support for priority-based request scheduling in the vLLM async engine. When the engine is configured with a scheduler policy set to priority, the .generate() method now supports an input parameter for priority (lowest priority first). This PR adds an optional input tensor for priority (defaults to 0) which is passed to the generate method.

Motivation

In applications where multiple sources submit work to the vLLM backend with different priorities, it is desirable to have the most time-sensitive work performed first. This feature allows users to:

  • Prioritize critical requests over background tasks
  • Implement different service level agreements (SLAs) for different types of requests
  • Better manage system resources by processing high-priority requests first

Changes

  1. Added an optional priority input tensor to the model configuration:

    {
        "name": "priority",
        "data_type": "TYPE_INT32",
        "dims": [1],
        "optional": True
    }
  2. Modified the _generate method to handle the priority parameter:

    if not priority:
        priority = 0
    response_iterator = self._llm_engine.generate(
        prompt, sampling_params, request_id, lora_request=lora_request, priority=priority
    )

Testing

  • Added unit tests for priority handling
  • Verified that requests with different priorities are processed in the correct order
  • Confirmed that default priority (0) works when priority is not specified

Documentation

  • Updated model configuration documentation
  • Added examples of priority usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants