-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Closed
Closed
Copy link
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Nightly
🐛 Describe the bug
vllm --version
INFO 03-17 04:56:52 [__init__.py:256] Automatically detected platform cuda.
0.7.4.dev497+ga73e183e
Use the model such as Deepseek's family of models, which have custom configuration_deepseek.py
. In the V0 engine, we see the following error
vllm serve /home/vllm-dev/DeepSeek-V2-Lite --trust-remote-code --tensor-parallel-size 2
INFO 03-17 04:59:17 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 04:59:18 [api_server.py:972] vLLM API server version 0.7.4.dev497+ga73e183e
INFO 03-17 04:59:18 [api_server.py:973] args: Namespace(subparser='serve', model_tag='/home/vllm-dev/DeepSeek-V2-Lite', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/vllm-dev/DeepSeek-V2-Lite', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f0d0102f640>)
INFO 03-17 04:59:18 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 04:59:24 [config.py:583] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 03-17 04:59:24 [arg_utils.py:1763] MLA is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 03-17 04:59:24 [arg_utils.py:1639] The model has a long context length (163840). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 03-17 04:59:24 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-17 04:59:24 [cuda.py:159] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 03-17 04:59:24 [api_server.py:236] Started engine process with PID 1945303
INFO 03-17 04:59:27 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 04:59:28 [llm_engine.py:241] Initializing a V0 LLM engine (v0.7.4.dev497+ga73e183e) with config: model='/home/vllm-dev/DeepSeek-V2-Lite', speculative_config=None, tokenizer='/home/vllm-dev/DeepSeek-V2-Lite', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=163840, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/vllm-dev/DeepSeek-V2-Lite, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-17 04:59:29 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 04:59:29 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
ERROR 03-17 04:59:29 [engine.py:443] Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
ERROR 03-17 04:59:29 [engine.py:443] Traceback (most recent call last):
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
ERROR 03-17 04:59:29 [engine.py:443] engine = MQLLMEngine.from_vllm_config(
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
ERROR 03-17 04:59:29 [engine.py:443] return cls(
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 80, in __init__
ERROR 03-17 04:59:29 [engine.py:443] self.engine = LLMEngine(*args, **kwargs)
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/engine/llm_engine.py", line 280, in __init__
ERROR 03-17 04:59:29 [engine.py:443] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 271, in __init__
ERROR 03-17 04:59:29 [engine.py:443] super().__init__(*args, **kwargs)
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 03-17 04:59:29 [engine.py:443] self._init_executor()
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/executor/mp_distributed_executor.py", line 90, in _init_executor
ERROR 03-17 04:59:29 [engine.py:443] worker = ProcessWorkerWrapper(result_handler,
ERROR 03-17 04:59:29 [engine.py:443] File "/home/vllm-dev/simon/bench/vllm/vllm/executor/multiproc_worker_utils.py", line 171, in __init__
ERROR 03-17 04:59:29 [engine.py:443] self.process.start()
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
ERROR 03-17 04:59:29 [engine.py:443] self._popen = self._Popen(self)
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
ERROR 03-17 04:59:29 [engine.py:443] return Popen(process_obj)
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
ERROR 03-17 04:59:29 [engine.py:443] super().__init__(process_obj)
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
ERROR 03-17 04:59:29 [engine.py:443] self._launch(process_obj)
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
ERROR 03-17 04:59:29 [engine.py:443] reduction.dump(process_obj, fp)
ERROR 03-17 04:59:29 [engine.py:443] File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ERROR 03-17 04:59:29 [engine.py:443] ForkingPickler(file, protocol).dump(obj)
ERROR 03-17 04:59:29 [engine.py:443] _pickle.PicklingError: Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 445, in run_mp_engine
raise e
File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
return cls(
File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 80, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/home/vllm-dev/simon/bench/vllm/vllm/engine/llm_engine.py", line 280, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 271, in __init__
super().__init__(*args, **kwargs)
File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/home/vllm-dev/simon/bench/vllm/vllm/executor/mp_distributed_executor.py", line 90, in _init_executor
worker = ProcessWorkerWrapper(result_handler,
File "/home/vllm-dev/simon/bench/vllm/vllm/executor/multiproc_worker_utils.py", line 171, in __init__
self.process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
Traceback (most recent call last):
File "/home/vllm-dev/simon/bench/.venv/bin/vllm", line 10, in <module>
sys.exit(main())
File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/cli/main.py", line 75, in main
args.dispatch_function(args)
File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/cli/serve.py", line 33, in cmd
uvloop.run(run_server(args))
File "/home/vllm-dev/simon/bench/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/vllm-dev/simon/bench/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 1007, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 259, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working