Skip to content

Proposal: LLMRateLimitingPlugin to simulate token-based throttling #1309

@waldekmastykarz

Description

@waldekmastykarz

Proposal

Introduce a new plugin named LLMRateLimitingPlugin that simulates throttling based on the number of input and output tokens within a specified timeframe. This plugin will enable developers to verify and test how their applications behave when token limits are exceeded, similar to how the existing RateLimitingPlugin works for request/response rates.

Key Features:

  • Simulate throttling based on configurable token limits (input and output tokens) within a user-defined timeframe.
  • Allow developers to configure thresholds and time windows for token consumption.
  • Provide feedback/response when token limits are exceeded, mirroring real-world LLM API behavior.
  • Useful for preparing applications for production LLM rate limits and improving resilience.

Motivation

Currently, the existing RateLimitingPlugin is focused on request/response rates. Many LLM providers enforce limits on tokens rather than request counts. This new plugin will help developers proactively identify and handle token-based throttling scenarios.

Reference

  • The new plugin should be modeled similarly to the existing RateLimitingPlugin, but should focus on tokens instead of requests/responses.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions