-
Notifications
You must be signed in to change notification settings - Fork 75
Labels
Description
Proposal
Introduce a new plugin named LLMRateLimitingPlugin that simulates throttling based on the number of input and output tokens within a specified timeframe. This plugin will enable developers to verify and test how their applications behave when token limits are exceeded, similar to how the existing RateLimitingPlugin works for request/response rates.
Key Features:
- Simulate throttling based on configurable token limits (input and output tokens) within a user-defined timeframe.
- Allow developers to configure thresholds and time windows for token consumption.
- Provide feedback/response when token limits are exceeded, mirroring real-world LLM API behavior.
- Useful for preparing applications for production LLM rate limits and improving resilience.
Motivation
Currently, the existing RateLimitingPlugin is focused on request/response rates. Many LLM providers enforce limits on tokens rather than request counts. This new plugin will help developers proactively identify and handle token-based throttling scenarios.
Reference
- The new plugin should be modeled similarly to the existing RateLimitingPlugin, but should focus on tokens instead of requests/responses.