[Frontend] Add chunked processing to handle long inputs in embedding models #20837

x22x22 · 2025-07-11T18:58:03Z

…g, and update relevant documentation and examples. New example scripts and service startup scripts are added to demonstrate how to configure and utilize chunking processing. Update the model configuration to support long - text processing and implement the chunking processing logic in the code.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add chunked processing support for long text embeddings to resolve CUDA crashes when input text exceeds model's maximum context length.

Problem Solved

CUDA crashes: vLLM embedding service crashes when processing text longer than max_model_len
Limited input length: No native support for handling arbitrarily long text in embedding models
Memory constraints: Large inputs cause out-of-memory errors during embedding generation

Solution

This PR implements automatic chunked processing at the serving layer that:

✅ Automatically detects when input exceeds model limits
✅ Splits long text into manageable chunks at token boundaries
✅ Processes each chunk independently to avoid memory issues
✅ Aggregates results using FastChat-style weighted averaging
✅ Maintains backward compatibility for short text inputs
✅ Requires zero changes to existing model implementations

Key Features

Zero model code modification: All logic implemented in serving layer
Configurable: Enabled via enable_chunked_processing: true in pooler config
Smart aggregation: Token count-based weighted averaging preserves semantic quality
Production ready: Comprehensive error handling and logging

Supported Models

intfloat/multilingual-e5-large (initially)
Extensible architecture for other embedding models

This enables vLLM to handle embedding requests of any length without crashes, significantly expanding its utility for RAG applications and long document processing.

Test Plan

Long Text Embedding with Chunked Processing

Test Result

Before modification

serve

ERROR 07-12 02:52:36 [engine.py:165] RuntimeError('CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, CUDA_R_16F, ldc, compute_type, CUBLAS
_GEMM_DEFAULT_TENSOR_OP)`')
ERROR 07-12 02:52:36 [engine.py:165] Traceback (most recent call last):
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 163, in start 
ERROR 07-12 02:52:36 [engine.py:165]     self.run_engine_loop()
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 226, in run_engine_loop
ERROR 07-12 02:52:36 [engine.py:165]     request_outputs = self.engine_step()
ERROR 07-12 02:52:36 [engine.py:165]                       ^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 252, in engine_step
ERROR 07-12 02:52:36 [engine.py:165]     raise e
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 235, in engine_step
ERROR 07-12 02:52:36 [engine.py:165]     return self.engine.step()
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/llm_engine.py", line 1356, in step
ERROR 07-12 02:52:36 [engine.py:165]     outputs = self.model_executor.execute_model(
ERROR 07-12 02:52:36 [engine.py:165]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/executor/executor_base.py", line 141, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     output = self.collective_rpc("execute_model",
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-12 02:52:36 [engine.py:165]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/utils/__init__.py", line 2943, in run_method
ERROR 07-12 02:52:36 [engine.py:165]     return func(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     output = self.model_runner.execute_model(
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-12 02:52:36 [engine.py:165]     return func(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/worker/pooling_model_runner.py", line 119, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     hidden_or_intermediate_states = model_executable(
ERROR 07-12 02:52:36 [engine.py:165]                                     ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 415, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return self.model(input_ids=input_ids,
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 350, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return self.encoder(hidden_states)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/compilation/decorators.py", line 246, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     model_output = self.forward(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 114, in forward
ERROR 07-12 02:52:36 [engine.py:165]     def forward(
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 07-12 02:52:36 [engine.py:165]     return fn(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 07-12 02:52:36 [engine.py:165]     return self._wrapped_call(self, *args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     raise e
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "<eval_with_key>.2", line 294, in forward
ERROR 07-12 02:52:36 [engine.py:165]     submod_0 = self.submod_0(l_hidden_states_,...l_self_modules_layer_module
s_23_modules_output_modules_layer_norm_parameters_bias_ = None
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/compilation/cuda_piecewise_backend.py", line 117, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return self.compiled_graph_for_general_shape(*args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 2143, in wrapper
ERROR 07-12 02:52:36 [engine.py:165]     return pytree.tree_unflatten(compiled_fn(*args, **kwargs), spec)
ERROR 07-12 02:52:36 [engine.py:165]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 07-12 02:52:36 [engine.py:165]     return fn(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return compiled_fn(full_args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
ERROR 07-12 02:52:36 [engine.py:165]     all_outs = call_func_at_runtime_with_args(
ERROR 07-12 02:52:36 [engine.py:165]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
ERROR 07-12 02:52:36 [engine.py:165]     out = normalize_as_list(f(args))
ERROR 07-12 02:52:36 [engine.py:165]                             ^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
ERROR 07-12 02:52:36 [engine.py:165]     outs = compiled_fn(args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
ERROR 07-12 02:52:36 [engine.py:165]     return compiled_fn(runtime_args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return self.current_callable(inputs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/utils.py", line 2404, in run
ERROR 07-12 02:52:36 [engine.py:165]     return model(new_inputs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/hs_data/.cache/vllm/torch_compile_cache/12188d34d2/rank_0_0/inductor_cache/xq/cxqsnh7zlyb6wqrdkusizoacfp34wawoczfn2qrddhljgmde7x2e.py", line 520, in call
ERROR 07-12 02:52:36 [engine.py:165]     extern_kernels.mm(reinterpret_tensor(buf1, (s0, 1024), (1024, 1), 0), reinterpret_tensor(arg4_1, (1024, 1024), (1, 1024), 0), out=buf4)
ERROR 07-12 02:52:36 [engine.py:165] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, CUDA_R_16F, ldc, compute_type, CUBLAS
_GEMM_DEFAULT_TENSOR_OP)`
[rank0]:[W712 02:52:37.923419125 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (f
unction operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2509407]

After modification

serve

INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-0: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [0, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-1: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-2: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-3: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 2], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-0.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-1.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-2.

client

# python ./examples/online_serving/openai_embedding_long_text_client.py
🚀 vLLM Long Text Embedding Client
📡 Connecting to: http://localhost:31090/v1
🤖 Model: multilingual-e5-large
🔑 API Key: ********-key
🧪 Testing vLLM Long Text Embedding with Chunked Processing
======================================================================

📝 Test 1: Short Text
Text length: 42 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.54s
   - Expected chunks: ~1
   - First 5 values: [0.01232257578521967, 0.009728744626045227, -0.014059314504265785, -0.03867439180612564, 0.037110574543476105]

📝 Test 2: Medium Text
Text length: 3200 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.04s
   - Expected chunks: ~1
   - First 5 values: [0.04108031839132309, -0.009568133391439915, -0.028527623042464256, -0.04032902047038078, 0.020682798698544502]

📝 Test 3: Long Text (2 chunks)
Text length: 27250 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.07s
   - Expected chunks: ~2
   - First 5 values: [0.04508449137210846, -0.017967931926250458, -0.014230169355869293, -0.03835897892713547, 0.003280746517702937]

📝 Test 4: Very Long Text (3+ chunks)
Text length: 88000 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.16s
   - Expected chunks: ~3
   - First 5 values: [0.03270554542541504, 0.0007968051359057426, -0.016265524551272392, -0.03590775281190872, -0.009043066762387753]

🔄 Testing Batch Embedding with Mixed Lengths
==================================================
✅ Batch processing successful!
   - Number of inputs: 4
   - Number of embeddings: 4
   - Total processing time: 0.08s
   - Average time per input: 0.02s
   - Input 1: 12 chars → 1024D embedding
   - Input 2: 860 chars → 1024D embedding
   - Input 3: 18 chars → 1024D embedding
   - Input 4: 20000 chars → 1024D embedding

🔍 Testing Embedding Consistency
========================================
   - Generated embedding 1
   - Generated embedding 2
   - Generated embedding 3
✅ Consistency test completed!
   - Cosine similarity between runs: 1.000000
   - Expected: ~1.0 (identical embeddings)
   - ✅ High consistency achieved!

======================================================================
🎉 All tests completed!

💡 Key Features Demonstrated:
   - ✅ Automatic chunked processing for long text
   - ✅ Seamless handling of mixed-length batches
   - ✅ Consistent embedding generation
   - ✅ Backward compatibility with short text

📚 For more information, see:
   - Documentation: https://docs.vllm.ai/en/latest/models/pooling_models.html
   - Chunked Processing Guide: openai_embedding_long_text.md

(Optional) Documentation Update

github-actions · 2025-07-11T18:58:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @x22x22, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to vLLM's embedding capabilities by adding support for chunked processing of long text inputs. This feature directly addresses issues like CUDA crashes and memory constraints when processing text longer than a model's max_model_len, allowing vLLM to generate embeddings for arbitrarily long documents without requiring changes to the underlying models. The solution is implemented at the serving layer, providing a configurable and robust mechanism for handling large inputs while maintaining embedding quality through smart aggregation.

Highlights

New Feature: Chunked Processing for Embedding Models: Introduced automatic chunked processing at the serving layer for embedding models. This enables vLLM to handle text inputs that exceed the model's maximum context length by splitting them into manageable chunks, processing each independently, and aggregating the results.
Configuration and Activation: The chunked processing feature is configurable via the PoolerConfig by setting enable_chunked_processing: true. It automatically detects when input exceeds max_model_len and triggers the chunking logic.
Intelligent Aggregation: Implemented a FastChat-style weighted averaging algorithm to combine embeddings from multiple chunks. This method uses token counts as weights, ensuring that longer chunks contribute proportionally more to the final aggregated embedding, preserving semantic quality.
Backward Compatibility and Model Support: The implementation maintains backward compatibility for short text inputs and requires zero modifications to existing model code. Initially, intfloat/multilingual-e5-large is explicitly supported, with an extensible architecture for other embedding models.
Documentation and Examples: Added comprehensive documentation detailing the feature, its configuration, how it works, performance characteristics, and limitations. New example scripts (server and client) are provided to demonstrate how to configure and utilize chunking processing for long text embeddings.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for handling long text embeddings by implementing chunked processing. The code is generally well-written, and the documentation and examples are thorough. I've identified a critical bug in the batch processing logic that could lead to incorrect results when multiple long prompts are sent in a single request. I've also provided several suggestions to improve code quality, maintainability, and performance. Once the critical issue is addressed, this will be a great addition to the project.

vllm/entrypoints/openai/serving_embedding.py

examples/online_serving/openai_embedding_long_text.md

examples/online_serving/openai_embedding_long_text_client.py

examples/online_serving/openai_embedding_long_text_service.sh

vllm/entrypoints/openai/serving_embedding.py

DarkLight1337 · 2025-07-12T04:18:38Z

cc @maxdebayser @22quinn @noooop

noooop · 2025-07-12T04:56:46Z

In fact, embedding models are not very suitable for handling extremely long inputs, as too much content can lead to embeddings that are not able to effectively distinguish between similar content.

Here's a simple way to confirm that automatic chunked processing is working effectively:

Reference mteb_test_embed_models in vllm/tests/models/language/pooling
/mteb_utils.py . and https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_speed.py

Keeping only the very front part of long context, such as 2048 or even 512, is an extremely high baseline.
Refer to LongEmbed: Extending Embedding Models for Long Context Retrieval

However, it still suffers from biased distribution of key information, as demonstratedin Figure 2. With only 512 context length, E5Base achieves >85% nDCG scores on 3 out of 8 publicly available LoCo tasks.

Do the following three comparative experiments

max_model_len = 2048
max_model_len =8102
max_model_len = 2048 + automatic chunked processing

If automatic chunked processing using multilingual-e5-large on mteb/T2Reranking dataset(or any test with a context exceeding 8K), can achieve comparable results indicates that automatic chunked processing is effective

x22x22 · 2025-07-12T06:34:28Z

In fact, embedding models are not very suitable for handling extremely long inputs, as too much content can lead to embeddings that are not able to effectively distinguish between similar content.

Here's a simple way to confirm that automatic chunked processing is working effectively:

Reference mteb_test_embed_models in vllm/tests/models/language/pooling /mteb_utils.py . and https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_speed.py

Keeping only the very front part of long context, such as 2048 or even 512, is an extremely high baseline. Refer to LongEmbed: Extending Embedding Models for Long Context Retrieval

However, it still suffers from biased distribution of key information, as demonstratedin Figure 2. With only 512 context length, E5Base achieves >85% nDCG scores on 3 out of 8 publicly available LoCo tasks.

Do the following three comparative experiments

max_model_len = 2048

max_model_len =8102

max_model_len = 2048 + automatic chunked processing

If automatic chunked processing using multilingual-e5-large on mteb/T2Reranking dataset(or any test with a context exceeding 8K), can achieve comparable results indicates that automatic chunked processing is effective

@noooop I've manually tested using text chunks exceeding 1,000 tokens in vector databases, and confirmed that short user queries or task descriptions (~100 tokens) can successfully retrieve relevant text fragments.

While this verification isn't scientifically rigorous, it demonstrates a viable practical solution. I'll allocate time later to run the benchmark tests you recommended - appreciate the suggestion.

noooop · 2025-07-13T07:40:08Z

@x22x22

After some investigation, intfloat/multilingual-e5-large uses the classic BERT architecture with a context length of 512, which appears very weak in 2025. Please perform a comparative test using jina-embeddings-v3, which has a maximum context length of 8192 and uses mean pooling.

Unless you use VLLM_ALLOW_LONG_MAX_MODEL_LEN or similar, you Should Not Allow set the context of intfloat/multilingual-e5-large beyond 512, as it will exceed position_embeddings and cause an out-of-bounds error. It is not a bug. Please weaken or remove the content related to CUDA crashes.

x22x22 · 2025-07-13T08:23:09Z

@x22x22

After some investigation, intfloat/multilingual-e5-large uses the classic BERT architecture with a context length of 512, which appears very weak in 2025. Please perform a comparative test using jina-embeddings-v3, which has a maximum context length of 8192 and uses mean pooling.

Unless you use VLLM_ALLOW_LONG_MAX_MODEL_LEN or similar, you Should Not Allow set the context of intfloat/multilingual-e5-large beyond 512, as it will exceed position_embeddings and cause an out-of-bounds error. It is not a bug. Please weaken or remove the content related to CUDA crashes.

@noooop
This enhancement specifically leverages VLLM_ALLOW_LONG_MAX_MODEL_LEN, and you can see the corresponding launch code in my test script here:
https://github.com/vllm-project/vllm/blob/da812672715ac5bb09a4e5e4acb1d6d2d59feca7/examples/online_serving/openai_embedding_long_text_service.sh

The purpose is to enable models like multilingual-e5-large to support longer contexts through sharding without modifying the model's original code. The same principle applies to other embedding models - for example, if you want jina-embeddings-v3 to support beyond its native 8192 context length, simply adjusting the MAX_MODEL_LEN parameter would achieve this.

While this approach may not deliver optimal embedding performance, it provides a practical low-cost solution for RAG scenarios requiring simultaneous processing of both short and long texts. Crucially, no performance penalty occurs when input stays within a model's native context limit (e.g. ≤512 for E5, ≤8192 for Jina), as no special chunking gets triggered.

Would you be open to continuing this discussion more efficiently via https://slack.vllm.ai? I've requested access to the Slack workspace but haven't received approval yet - perhaps we could connect there once I'm onboarded.

noooop · 2025-07-13T08:34:08Z

I looked through the code carefully.

You can add a new parameter such as max_embed_len, but do not modify any code related to max_model_len, That will cause a huge number of bugs.

And do not use VLLM_ALLOW_LONG_MAX_MODEL_LEN.

I think we should remove VLLM_ALLOW_LONG_MAX_MODEL_LEN. I can’t think of any use case that would require this flag.

x22x22 · 2025-07-13T08:46:33Z

I looked through the code carefully. You can add a new parameter such as max_embed_len, but do not modify any code related to max_model_len, and do not use VLLM_ALLOW_LONG_MAX_MODEL_LEN. That will cause a huge number of bugs

@noooop
Understood - I'll modify the code tomorrow following your guidance. Instead of the VLLM_ALLOW_LONG_MAX_MODEL_LEN approach, I'll implement a dedicated max_embed_len parameter to handle extended context lengths for embedding models. This will avoid any interference with the core max_model_len logic and prevent potential side effects.

Regarding communication, would you be open to continuing this discussion through a more efficient channel? I'd appreciate if we could connect either via:

https://slack.vllm.ai (I'm still awaiting access approval), or
WeChat if that's more convenient

Would either of these options work better for real-time collaboration? Thank you for your guidance on this implementation!

noooop · 2025-07-13T08:51:02Z

Regarding communication, would you be open to continuing this discussion through a more efficient channel? I'd appreciate if we could connect either via:

https://slack.vllm.ai (I'm still awaiting access approval), or

WeChat if that's more convenient

I’m extremely socially anxious.

x22x22 · 2025-07-13T09:39:44Z

Regarding communication, would you be open to continuing this discussion through a more efficient channel? I'd appreciate if we could connect either via:

https://slack.vllm.ai (I'm still awaiting access approval), or

WeChat if that's more convenient

I’m extremely socially anxious.

@noooop

I completely understand, I also have social anxiety. This way of communicating is pretty good too 😄

I'll modify the code according to your suggestions, expecting to have it done by tomorrow~ If there's anything else I need to pay attention to, please feel free to communicate anytime, thank you!

x22x22 · 2025-07-13T15:41:15Z

@noooop
I've addressed both concerns:

I've removed the dependency on VLLM_ALLOW_LONG_MAX_MODEL_LEN
Instead of modifying max_model_len, we now configure max_embed_len through the --override-pooler-config parameter as follows:

{
  "pooling_type": "CLS",
  "normalize": true,
  "enable_chunked_processing": true,
  "max_embed_len": 10240
}

maxdebayser · 2025-07-14T14:35:51Z

@x22x22 , please correct me if I'm wrong, but it seems that the aggregation is based on the assumptions that taking the mean of the embedding chunks would be correct:

aggregated_embedding = weighted_sum / total_weight

To work two requirements are necessary:

That the pooling type is MEAN (and not CLS, LAST or others)
That the model uses a causal attention mask so that the tokens can only attend to previous tokens.

However, the BERT-type models don't satisfy the second requirement.

As @noooop mentioned, there are newer models that are decoder models. For these models, if the pooling type is LAST, we already support chunked prefill.

I think we should remove VLLM_ALLOW_LONG_MAX_MODEL_LEN. I can’t think of any use case that would require this flag.

@noooop , this var is useful for testing or bypassing restrictions of misconfigured models.

vllm/entrypoints/openai/serving_embedding.py

x22x22 · 2025-07-14T16:05:48Z

@x22x22 , please correct me if I'm wrong, but it seems that the aggregation is based on the assumptions that taking the mean of the embedding chunks would be correct:
aggregated_embedding = weighted_sum / total_weight
To work two requirements are necessary:

That the pooling type is MEAN (and not CLS, LAST or others)

That the model uses a causal attention mask so that the tokens can only attend to previous tokens.

However, the BERT-type models don't satisfy the second requirement.
@maxdebayser

Thank you for your feedback - you're absolutely right! I've updated the implementation to use MEAN pooling and made the following improvements:

Correct pooling configuration: multilingual-e5-large now uses MEAN pooling by default

Automatic detection: Added support for automatic configuration of various popular models

Manual specification: Users can manually specify the pooling type with helpful prompts and guidance

Safety warnings: Users are now alerted about potential impacts when using non-MEAN pooling methods

Flexible configuration: All parameters can be customized through environment variables

This ensures users can safely utilize the chunked processing functionality without worrying about pooling type mismatches!

For usage reference, please check out the new example startup script:
https://github.com/vllm-project/vllm/blob/a5432ac40c23dcbeba8ce3bb6af4084591dd0f47/examples/online_serving/openai_embedding_long_text_service.sh

Your point about BERT-type models not satisfying the causal attention requirement is particularly important. The weighted aggregation approach works best with models that use mean pooling and have the appropriate attention patterns. The automatic detection and safety warnings should help users avoid potential issues with incompatible model architectures.

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: x22x22 <[email protected]>

…vllm-project#21964) Signed-off-by: x22x22 <[email protected]>

…el installation (vllm-project#21635) Signed-off-by: Ming Yang <[email protected]> Signed-off-by: x22x22 <[email protected]>

) Signed-off-by: cascade812 <[email protected]> Signed-off-by: x22x22 <[email protected]>

… requests (vllm-project#20272) Signed-off-by: x22x22 <[email protected]>

…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: x22x22 <[email protected]>

… in python (vllm-project#21763) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: Sanchit Gandhi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

…llm-project#21818) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

…roject#21830) Signed-off-by: Andy Xie <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: x22x22 <[email protected]>

…1599) Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: x22x22 <[email protected]>

…deleted, and the code and configurations that are no longer in use have been cleaned up. Signed-off-by: x22x22 <[email protected]>

…ized, with the use of mean aggregation enforced and support for other aggregation types removed. Relevant log information has been updated to reflect the new processing approach. Signed-off-by: x22x22 <[email protected]>

…n aggregation is uniformly adopted, and support for other aggregation types has been removed. Relevant documents and configurations have been updated to reflect the new processing approach. Configuration options that are no longer in use have been removed to ensure the code's cleanliness. Signed-off-by: x22x22 <[email protected]>

…wline Files should end with a single newline character 117 Error: docs/models/supported_models.md:777:265 MD047/single-trailing-newline Files should end with a single newline character 118 Error: examples/online_serving/openai_embedding_long_text.md:96 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **Academic papers**: Full re..."] 119 Error: examples/online_serving/openai_embedding_long_text.md:130 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 120 Error: examples/online_serving/openai_embedding_long_text.md:138 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 121 Error: examples/online_serving/openai_embedding_long_text.md:146 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 122 Error: examples/online_serving/openai_embedding_long_text.md:159 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] 123 Signed-off-by: x22x22 <[email protected]>

…vice scripts, incorporating chunk processing support. The README documentation has been revised to include a quick start guide and comprehensive configuration instructions. Server startup scripts have been enhanced with automatic detection of optimal pooling types, significantly improving performance and compatibility for long-text processing. Signed-off-by: x22x22 <[email protected]>

mergify · 2025-08-05T21:42:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @x22x22.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

x22x22 requested review from hmellor, simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad and aarnphm as code owners July 11, 2025 18:58

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation frontend labels Jul 11, 2025

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

x22x22 force-pushed the feat/support-long-text-embedding branch from b5f245d to 5398bbd Compare July 11, 2025 19:01

x22x22 changed the title ~~[Core] Add chunked processing to handle long inputs in embedding models~~ [Frontend] Add chunked processing to handle long inputs in embedding models Jul 11, 2025

noooop mentioned this pull request Jul 14, 2025

[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN #20904

Merged

4 tasks

maxdebayser reviewed Jul 14, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Show resolved Hide resolved

njhill and others added 21 commits August 6, 2025 05:33

[Misc] Support more collective_rpc return types (vllm-project#21845)

45447ab

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: x22x22 <[email protected]>

For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted (…

e8e693a

…vllm-project#21964) Signed-off-by: x22x22 <[email protected]>

[Misc] Use dracut on CentOS and skip clone if repo exists for EP kern…

ace708f

…el installation (vllm-project#21635) Signed-off-by: Ming Yang <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Feature] Add async tensor parallelism for scaled mm (vllm-project#20155

d8a2eae

) Signed-off-by: cascade812 <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Bugfix] Fix None value handling in trace span creation for cancelled…

181202f

… requests (vllm-project#20272) Signed-off-by: x22x22 <[email protected]>

[Core] Move EngineCoreRequest to Request conversion out of EngineCore (…

3b91b17

…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

f261288

… in python (vllm-project#21763) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Bugfix] Relax lang pin for voxtral (vllm-project#21833)

1843059

Signed-off-by: Sanchit Gandhi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 <[email protected]>

[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (vllm-project#21966)

474f25d

Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels (v…

07be3b9

…llm-project#21818) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

52f1a7e

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Bugfix]: fix metadata file copy in test_sharded_state_loader (vllm-p…

3eee204

…roject#21830) Signed-off-by: Andy Xie <[email protected]> Signed-off-by: x22x22 <[email protected]>

[Deprecation] Remove deprecated args and methods (vllm-project#21907)

6967d0e

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: x22x22 <[email protected]>

[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES (vllm-project#2…

f97ff9b

…1599) Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: x22x22 <[email protected]>

merge mian to feat/support-long-text-embedding

b28069d

Signed-off-by: x22x22 <[email protected]>

The files diff_config.py and diff_serving_embedding.py have been …

a0955f8

…deleted, and the code and configurations that are no longer in use have been cleaned up. Signed-off-by: x22x22 <[email protected]>

The files diff_config.py and diff_serving_embedding.py have been …

5478169

…deleted, and the code and configurations that are no longer in use have been cleaned up. Signed-off-by: x22x22 <[email protected]>

x22x22 force-pushed the feat/support-long-text-embedding branch from ca252d7 to b1562c5 Compare August 5, 2025 21:33

mergify bot added the needs-rebase label Aug 5, 2025

x22x22 force-pushed the feat/support-long-text-embedding branch from e10ef93 to a835f52 Compare August 5, 2025 21:47

x22x22 closed this Aug 5, 2025

x22x22 deleted the feat/support-long-text-embedding branch August 5, 2025 21:53

github-project-automation bot moved this to Done in Tool Calling Aug 5, 2025

github-project-automation bot moved this to Done in Structured Output Aug 5, 2025

x22x22 mentioned this pull request Aug 5, 2025

[Frontend] Add chunked processing to handle long inputs in embedding models #22280

Merged

Uh oh!

[Frontend] Add chunked processing to handle long inputs in embedding models #20837

[Frontend] Add chunked processing to handle long inputs in embedding models #20837

Uh oh!

Conversation

x22x22 commented Jul 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Problem Solved

Solution

Key Features

Supported Models

Test Plan

Test Result

Before modification

After modification

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Jul 12, 2025

Uh oh!

noooop commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x22x22 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Jul 13, 2025

Uh oh!

x22x22 commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x22x22 commented Jul 13, 2025

Uh oh!

noooop commented Jul 13, 2025

Uh oh!

x22x22 commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x22x22 commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Jul 14, 2025

Uh oh!

Uh oh!

x22x22 commented Jul 14, 2025

Uh oh!

mergify bot commented Aug 5, 2025

Uh oh!

Uh oh!

x22x22 commented Jul 11, 2025 •

edited by github-actions bot

Loading

noooop commented Jul 12, 2025 •

edited

Loading

x22x22 commented Jul 12, 2025 •

edited

Loading

x22x22 commented Jul 13, 2025 •

edited

Loading

noooop commented Jul 13, 2025 •

edited

Loading

x22x22 commented Jul 13, 2025 •

edited

Loading

x22x22 commented Jul 13, 2025 •

edited

Loading