[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests.

### 🚀 The feature, motivation and pitch

I'm running vLLM for production LLM hosting and would like to cap the max_tokens (total number of generated output tokens) for all requests. Currently when using the OpenAI API server, the [default_max_tokens](https://github.com/vllm-project/vllm/blob/9597a095f2c02670b44f5973635ce4b9852e8eab/vllm/entrypoints/openai/serving_chat.py#L190)  is calculated to be the context_window - prompt tokens. However, for models like Llama-3.1 which has a context window of 128K, this far too large.



### Alternatives

One potential solution would allow having `max_new_tokens` be specified in the generation_config.json file that would be read at launch time.  This could then become the server's `max_tokens`.  Currently, only repetition_penalty, temperature, top_k, top_p, and min_p seem to be [supported](https://github.com/vllm-project/vllm/blob/9597a095f2c02670b44f5973635ce4b9852e8eab/vllm/config.py#L898) 

The code in [openai/serving_completion](https://github.com/vllm-project/vllm/blob/9597a095f2c02670b44f5973635ce4b9852e8eab/vllm/entrypoints/openai/serving_completion.py#L118) would need to take into account that you want the minimum of the (max_model_len-prompt_tokens, generation_max_tokens).

In addition, the [openai/protocol](https://github.com/vllm-project/vllm/blob/9597a095f2c02670b44f5973635ce4b9852e8eab/vllm/entrypoints/openai/protocol.py#L400) would need to take into account that a client's request max_tokens can't exceed the default_max_tokens value.


### Additional context

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions