-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
v0.10.2 does work for other models
Qwen/Qwen3-Embedding-0.6B used to work under v0.10.1.1
docker compose up
[+] Running 2/2
✔ Network vllm-embed_default Created 0.1s
✔ Container vllm-embed-vllm-1 Created 0.0s
Attaching to vllm-1
vllm-1 | /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
vllm-1 | import pynvml # type: ignore[import]
vllm-1 | INFO 09-14 02:33:54 [init.py:216] Automatically detected platform cuda.
vllm-1 | WARNING 09-14 02:33:57 [init.py:1766] argument 'task' is deprecated
vllm-1 | (APIServer pid=1) INFO 09-14 02:33:57 [api_server.py:1896] vLLM API server version 0.10.2
vllm-1 | (APIServer pid=1) INFO 09-14 02:33:57 [utils.py:328] non-default args: {'host': '0.0.0.0', 'port': 8002, 'model': 'Qwen/Qwen3-Embedding-0.6B', 'task': 'embed', 'max_model_len': 8192}
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [init.py:742] Resolved architecture: Qwen3ForCausalLM
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [config.py:708] Found sentence-transformers modules configuration.
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [config.py:728] Found pooling configuration.
vllm-1 | (APIServer pid=1) torch_dtype
is deprecated! Use dtype
instead!
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:08 [init.py:1815] Using max model len 8192
vllm-1 | (APIServer pid=1) INFO 09-14 02:34:09 [api_server.py:296] Started engine process with PID 172
vllm-1 | /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
vllm-1 | import pynvml # type: ignore[import]
vllm-1 | INFO 09-14 02:34:14 [init.py:216] Automatically detected platform cuda.
vllm-1 | INFO 09-14 02:34:16 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='Qwen/Qwen3-Embedding-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen3-Embedding-0.6B, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
vllm-1 | INFO 09-14 02:34:18 [cuda.py:456] Using Flash Attention backend.
vllm-1 | [W914 02:34:19.043011671 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm-1 | INFO 09-14 02:34:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm-1 | INFO 09-14 02:34:19 [model_runner.py:1051] Starting to load model Qwen/Qwen3-Embedding-0.6B...
vllm-1 | INFO 09-14 02:34:19 [weight_utils.py:348] Using model weights format ['*.safetensors']
vllm-1 | INFO 09-14 02:34:19 [weight_utils.py:406] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.95it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.94it/s]
vllm-1 |
vllm-1 | INFO 09-14 02:34:20 [default_loader.py:268] Loading weights took 0.28 seconds
vllm-1 | INFO 09-14 02:34:20 [model_runner.py:1083] Model loading took 1.1177 GiB and 0.584041 seconds
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] Traceback (most recent call last):
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] engine = MQLLMEngine.from_vllm_config(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1589, in inner
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return fn(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return cls(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.engine = LLMEngine(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 262, in init
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self._initialize_kv_caches()
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 405, in _initialize_kv_caches
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.model_executor.determine_num_available_blocks())
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 105, in determine_num_available_blocks
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] results = self.collective_rpc("determine_num_available_blocks")
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] answer = run_method(self.driver_worker, method, args, kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3060, in run_method
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 312, in determine_num_available_blocks
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] available_kv_cache_memory = self.determine_available_kv_cache_memory(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 261, in determine_available_kv_cache_memory
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.model_runner.profile_run()
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1175, in profile_run
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self._dummy_run(max_num_batched_tokens, max_num_seqs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1301, in _dummy_run
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] self.execute_model(model_input, kv_caches, intermediate_tensors)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] return func(*args, **kwargs)
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in execute_model
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] logits = self.model.compute_logits(hidden_or_intermediate_states,
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 333, in compute_logits
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] logits = self.logits_processor(self.lm_head, hidden_states,
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1962, in getattr
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] raise AttributeError(
vllm-1 | ERROR 09-14 02:34:38 [engine.py:468] AttributeError: 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | Process SpawnProcess-1:
vllm-1 | Traceback (most recent call last):
vllm-1 | File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
vllm-1 | self.run()
vllm-1 | File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
vllm-1 | self._target(*self._args, **self._kwargs)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 470, in run_mp_engine
vllm-1 | raise e from None
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
vllm-1 | engine = MQLLMEngine.from_vllm_config(
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1589, in inner
vllm-1 | return fn(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
vllm-1 | return cls(
vllm-1 | ^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
vllm-1 | self.engine = LLMEngine(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 262, in init
vllm-1 | self._initialize_kv_caches()
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 405, in _initialize_kv_caches
vllm-1 | self.model_executor.determine_num_available_blocks())
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 105, in determine_num_available_blocks
vllm-1 | results = self.collective_rpc("determine_num_available_blocks")
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
vllm-1 | answer = run_method(self.driver_worker, method, args, kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3060, in run_method
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 312, in determine_num_available_blocks
vllm-1 | available_kv_cache_memory = self.determine_available_kv_cache_memory(
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 261, in determine_available_kv_cache_memory
vllm-1 | self.model_runner.profile_run()
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1175, in profile_run
vllm-1 | self._dummy_run(max_num_batched_tokens, max_num_seqs)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1301, in _dummy_run
vllm-1 | self.execute_model(model_input, kv_caches, intermediate_tensors)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
vllm-1 | return func(*args, **kwargs)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in execute_model
vllm-1 | logits = self.model.compute_logits(hidden_or_intermediate_states,
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 333, in compute_logits
vllm-1 | logits = self.logits_processor(self.lm_head, hidden_states,
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1962, in getattr
vllm-1 | raise AttributeError(
vllm-1 | AttributeError: 'Qwen3ForEmbedding' object has no attribute 'logits_processor'
vllm-1 | [rank0]:[W914 02:34:39.035746729 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
vllm-1 | (APIServer pid=1) Traceback (most recent call last):
vllm-1 | (APIServer pid=1) File "", line 198, in _run_module_as_main
vllm-1 | (APIServer pid=1) File "", line 88, in _run_code
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2011, in
vllm-1 | (APIServer pid=1) uvloop.run(run_server(args))
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
vllm-1 | (APIServer pid=1) return __asyncio.run(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
vllm-1 | (APIServer pid=1) return runner.run(main)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
vllm-1 | (APIServer pid=1) return self._loop.run_until_complete(task)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
vllm-1 | (APIServer pid=1) return await main
vllm-1 | (APIServer pid=1) ^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
vllm-1 | (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
vllm-1 | (APIServer pid=1) async with build_async_engine_client(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | (APIServer pid=1) return await anext(self.gen)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
vllm-1 | (APIServer pid=1) async with build_async_engine_client_from_engine_args(
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | (APIServer pid=1) return await anext(self.gen)
vllm-1 | (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 319, in build_async_engine_client_from_engine_args
vllm-1 | (APIServer pid=1) raise RuntimeError(
vllm-1 | (APIServer pid=1) RuntimeError: Engine process failed to start. See stack trace for the root cause.
vllm-1 exited with code 1
Gracefully Stopping... press Ctrl+C again to force
Container vllm-embed-vllm-1 Stopping
Container vllm-embed-vllm-1 Stopped
jhsmith@dell5810:/vllm-embed$ docker compose down/vllm-embed$ cat docker-compose.yml
[+] Running 2/2
✔ Container vllm-embed-vllm-1 Removed 0.1s
✔ Network vllm-embed_default Removed 0.2s
jhsmith@dell5810:
services:
vllm:
build:
context: . # Specifies the directory containing the Dockerfile (current directory)
image: vllm/vllm-openai:v0.10.2
restart: unless-stopped
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- VLLM_GPU_MEMORY_UTILIZATION=0.85
- VLLM_FLASH_ATTN_VERSION=2
- VLLM_USE_V1=0
ports:
- "8002:8002"
ipc: host
command:
- --model
- Qwen/Qwen3-Embedding-0.6B
- --tensor-parallel-size
- "1"
- --max-model-len
- "8192"
- --host
- 0.0.0.0
- --port
- "8002"
- --task
- embed
networks:
llm-network:
external: true
🐛 Describe the bug
Qwen/Qwen3-Embedding-0.6B used to work under v0.10.1.1
Other models work under v0.10.2 but Qwen/Qwen3-Embedding-0.6B is giving the error listed above
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.