Skip to content

[Bug]: v0.10.2 no longer supports Qwen/Qwen3-Embedding-0.6B #24827

@jhsmith409

Description

@jhsmith409

Your current environment

v0.10.2 does work for other models
Qwen/Qwen3-Embedding-0.6B used to work under v0.10.1.1

docker compose up
[+] Running 2/2
✔ Network vllm-embed_default Created 0.1s
✔ Container vllm-embed-vllm-1 Created 0.0s
Attaching to vllm-1
vllm-1 | /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
vllm-1 | import pynvml # type: ignore[import]
vllm-1 | INFO 09-14 02:33:54 [init.py:216] Automatically detected platform cuda.
vllm-1 | WARNING 09-14 02:33:57 [init.py:1766] argument 'task' is deprecated
vllm-1 | (APIServer pid=1) INFO 09-14 02:33:57 [api_server.py:1896] vLLM API server version 0.10.2
vllm-1 | (APIServer pid=1) INFO 09-14 02:33:57 [utils.py:328] non-default args: {'host': '0.0.0.0', 'port': 8002, 'model': 'Qwen/Qwen3-Embedding-0.6B', 'task': 'embed', 'max_model_len': 8192}
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [init.py:742] Resolved architecture: Qwen3ForCausalLM
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [config.py:708] Found sentence-transformers modules configuration.
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [config.py:728] Found pooling configuration.
vllm-1 | (APIServer pid=1) torch_dtype is deprecated! Use dtype instead!
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [init.py:1815] Using max model len 8192
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:09 [api_server.py:296] Started engine process with PID 172
vllm-1 | /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
vllm-1 | import pynvml # type: ignore[import]
vllm-1 | INFO 09-14 02:34:14 [init.py:216] Automatically detected platform cuda.
vllm-1 | INFO 09-14 02:34:16 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='Qwen/Qwen3-Embedding-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen3-Embedding-0.6B, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
vllm-1 | INFO 09-14 02:34:18 [cuda.py:456] Using Flash Attention backend.
vllm-1 | [W914 02:34:19.043011671 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | INFO 09-14 02:34:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm-1 | INFO 09-14 02:34:19 [model_runner.py:1051] Starting to load model Qwen/Qwen3-Embedding-0.6B...
vllm-1 | INFO 09-14 02:34:19 [weight_utils.py:348] Using model weights format ['*.safetensors']
vllm-1 | INFO 09-14 02:34:19 [weight_utils.py:406] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.95it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.94it/s]
vllm-1 |
vllm-1 | INFO 09-14 02:34:20 [default_loader.py:268] Loading weights took 0.28 seconds
vllm-1 | INFO 09-14 02:34:20 [model_runner.py:1083] Model loading took 1.1177 GiB and 0.584041 seconds
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] Traceback (most recent call last):
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] engine = MQLLMEngine.from_vllm_config(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1589, in inner
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return fn(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return cls(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.engine = LLMEngine(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 262, in init
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self._initialize_kv_caches()
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 405, in _initialize_kv_caches
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.model_executor.determine_num_available_blocks())
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 105, in determine_num_available_blocks
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] results = self.collective_rpc("determine_num_available_blocks")
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] answer = run_method(self.driver_worker, method, args, kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3060, in run_method
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 312, in determine_num_available_blocks
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] available_kv_cache_memory = self.determine_available_kv_cache_memory(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 261, in determine_available_kv_cache_memory
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.model_runner.profile_run()
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1175, in profile_run
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self._dummy_run(max_num_batched_tokens, max_num_seqs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1301, in _dummy_run
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.execute_model(model_input, kv_caches, intermediate_tensors)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in execute_model
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] logits = self.model.compute_logits(hidden_or_intermediate_states,
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 333, in compute_logits
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] logits = self.logits_processor(self.lm_head, hidden_states,
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1962, in getattr
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] raise AttributeError(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] AttributeError: 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | Process SpawnProcess-1:
vllm-1 | Traceback (most recent call last):
vllm-1 | File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
vllm-1 | self.run()
vllm-1 | File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
vllm-1 | self._target(*self._args, **self._kwargs)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 470, in run_mp_engine
vllm-1 | raise e from None
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
vllm-1 | engine = MQLLMEngine.from_vllm_config(
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1589, in inner
vllm-1 | return fn(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
vllm-1 | return cls(
vllm-1 | ^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
vllm-1 | self.engine = LLMEngine(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 262, in init
vllm-1 | self._initialize_kv_caches()
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 405, in _initialize_kv_caches
vllm-1 | self.model_executor.determine_num_available_blocks())
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 105, in determine_num_available_blocks
vllm-1 | results = self.collective_rpc("determine_num_available_blocks")
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
vllm-1 | answer = run_method(self.driver_worker, method, args, kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3060, in run_method
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 312, in determine_num_available_blocks
vllm-1 | available_kv_cache_memory = self.determine_available_kv_cache_memory(
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 261, in determine_available_kv_cache_memory
vllm-1 | self.model_runner.profile_run()
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1175, in profile_run
vllm-1 | self._dummy_run(max_num_batched_tokens, max_num_seqs)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1301, in _dummy_run
vllm-1 | self.execute_model(model_input, kv_caches, intermediate_tensors)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in execute_model
vllm-1 | logits = self.model.compute_logits(hidden_or_intermediate_states,
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 333, in compute_logits
vllm-1 | logits = self.logits_processor(self.lm_head, hidden_states,
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1962, in getattr
vllm-1 | raise AttributeError(
vllm-1 | AttributeError: 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | [rank0]:[W914 02:34:39.035746729 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
vllm-1 | (APIServer pid=1) Traceback (most recent call last):
vllm-1 | (APIServer pid=1) File "", line 198, in _run_module_as_main
vllm-1 | (APIServer pid=1) File "", line 88, in _run_code
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2011, in
vllm-1 | (APIServer pid=1) uvloop.run(run_server(args))
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
vllm-1 | (APIServer pid=1) return __asyncio.run(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
vllm-1 | (APIServer pid=1) return runner.run(main)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
vllm-1 | (APIServer pid=1) return self._loop.run_until_complete(task)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
vllm-1 | (APIServer pid=1) return await main
vllm-1 | (APIServer pid=1) ^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
vllm-1 | (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
vllm-1 | (APIServer pid=1) async with build_async_engine_client(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | (APIServer pid=1) return await anext(self.gen)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
vllm-1 | (APIServer pid=1) async with build_async_engine_client_from_engine_args(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | (APIServer pid=1) return await anext(self.gen)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 319, in build_async_engine_client_from_engine_args
vllm-1 | (APIServer pid=1) raise RuntimeError(
vllm-1 | (APIServer pid=1) RuntimeError: Engine process failed to start. See stack trace for the root cause.
vllm-1 exited with code 1
Gracefully Stopping... press Ctrl+C again to force
Container vllm-embed-vllm-1 Stopping
Container vllm-embed-vllm-1 Stopped
jhsmith@dell5810:/vllm-embed$ docker compose down
[+] Running 2/2
✔ Container vllm-embed-vllm-1 Removed 0.1s
✔ Network vllm-embed_default Removed 0.2s
jhsmith@dell5810:
/vllm-embed$ cat docker-compose.yml
services:
vllm:

build:

context: . # Specifies the directory containing the Dockerfile (current directory)

image: vllm/vllm-openai:v0.10.2
restart: unless-stopped
runtime: nvidia
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]
volumes:
  - ~/.cache/huggingface:/root/.cache/huggingface
environment:
  - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
  - VLLM_GPU_MEMORY_UTILIZATION=0.85
  - VLLM_FLASH_ATTN_VERSION=2
  - VLLM_USE_V1=0
ports:
  - "8002:8002"
ipc: host
command:
  - --model
  - Qwen/Qwen3-Embedding-0.6B
  - --tensor-parallel-size
  - "1"
  - --max-model-len
  - "8192"
  - --host
  - 0.0.0.0
  - --port
  - "8002"
  - --task
  - embed

networks:
llm-network:
external: true

🐛 Describe the bug

Qwen/Qwen3-Embedding-0.6B used to work under v0.10.1.1
Other models work under v0.10.2 but Qwen/Qwen3-Embedding-0.6B is giving the error listed above

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions