-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
🚀 The feature, motivation and pitch
I'm running vLLM for production LLM hosting and would like to cap the max_tokens (total number of generated output tokens) for all requests. Currently when using the OpenAI API server, the default_max_tokens is calculated to be the context_window - prompt tokens. However, for models like Llama-3.1 which has a context window of 128K, this far too large.
Alternatives
One potential solution would allow having max_new_tokens
be specified in the generation_config.json file that would be read at launch time. This could then become the server's max_tokens
. Currently, only repetition_penalty, temperature, top_k, top_p, and min_p seem to be supported
The code in openai/serving_completion would need to take into account that you want the minimum of the (max_model_len-prompt_tokens, generation_max_tokens).
In addition, the openai/protocol would need to take into account that a client's request max_tokens can't exceed the default_max_tokens value.
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.