Skip to content

[Bug]: V0+TP doesn't work with trust remote code for custom configuration #14925

@simon-mo

Description

@simon-mo

Your current environment

Nightly

🐛 Describe the bug

 vllm --version
INFO 03-17 04:56:52 [__init__.py:256] Automatically detected platform cuda.
0.7.4.dev497+ga73e183e

Use the model such as Deepseek's family of models, which have custom configuration_deepseek.py. In the V0 engine, we see the following error

vllm serve /home/vllm-dev/DeepSeek-V2-Lite --trust-remote-code --tensor-parallel-size 2
INFO 03-17 04:59:17 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 04:59:18 [api_server.py:972] vLLM API server version 0.7.4.dev497+ga73e183e
INFO 03-17 04:59:18 [api_server.py:973] args: Namespace(subparser='serve', model_tag='/home/vllm-dev/DeepSeek-V2-Lite', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/vllm-dev/DeepSeek-V2-Lite', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f0d0102f640>)
INFO 03-17 04:59:18 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 04:59:24 [config.py:583] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 03-17 04:59:24 [arg_utils.py:1763] MLA is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 03-17 04:59:24 [arg_utils.py:1639] The model has a long context length (163840). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 03-17 04:59:24 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-17 04:59:24 [cuda.py:159] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 03-17 04:59:24 [api_server.py:236] Started engine process with PID 1945303
INFO 03-17 04:59:27 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 04:59:28 [llm_engine.py:241] Initializing a V0 LLM engine (v0.7.4.dev497+ga73e183e) with config: model='/home/vllm-dev/DeepSeek-V2-Lite', speculative_config=None, tokenizer='/home/vllm-dev/DeepSeek-V2-Lite', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=163840, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/vllm-dev/DeepSeek-V2-Lite, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 03-17 04:59:29 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 04:59:29 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
ERROR 03-17 04:59:29 [engine.py:443] Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
ERROR 03-17 04:59:29 [engine.py:443] Traceback (most recent call last):
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
ERROR 03-17 04:59:29 [engine.py:443]     engine = MQLLMEngine.from_vllm_config(
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
ERROR 03-17 04:59:29 [engine.py:443]     return cls(
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 80, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     self.engine = LLMEngine(*args, **kwargs)
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/engine/llm_engine.py", line 280, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 271, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     super().__init__(*args, **kwargs)
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     self._init_executor()
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/executor/mp_distributed_executor.py", line 90, in _init_executor
ERROR 03-17 04:59:29 [engine.py:443]     worker = ProcessWorkerWrapper(result_handler,
ERROR 03-17 04:59:29 [engine.py:443]   File "/home/vllm-dev/simon/bench/vllm/vllm/executor/multiproc_worker_utils.py", line 171, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     self.process.start()
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
ERROR 03-17 04:59:29 [engine.py:443]     self._popen = self._Popen(self)
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
ERROR 03-17 04:59:29 [engine.py:443]     return Popen(process_obj)
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     super().__init__(process_obj)
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
ERROR 03-17 04:59:29 [engine.py:443]     self._launch(process_obj)
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
ERROR 03-17 04:59:29 [engine.py:443]     reduction.dump(process_obj, fp)
ERROR 03-17 04:59:29 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ERROR 03-17 04:59:29 [engine.py:443]     ForkingPickler(file, protocol).dump(obj)
ERROR 03-17 04:59:29 [engine.py:443] _pickle.PicklingError: Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 445, in run_mp_engine
    raise e
  File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
  File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
    return cls(
  File "/home/vllm-dev/simon/bench/vllm/vllm/engine/multiprocessing/engine.py", line 80, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/home/vllm-dev/simon/bench/vllm/vllm/engine/llm_engine.py", line 280, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
  File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 271, in __init__
    super().__init__(*args, **kwargs)
  File "/home/vllm-dev/simon/bench/vllm/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/home/vllm-dev/simon/bench/vllm/vllm/executor/mp_distributed_executor.py", line 90, in _init_executor
    worker = ProcessWorkerWrapper(result_handler,
  File "/home/vllm-dev/simon/bench/vllm/vllm/executor/multiproc_worker_utils.py", line 171, in __init__
    self.process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config'>: it's not the same object as transformers_modules.DeepSeek-V2-Lite.configuration_deepseek.DeepseekV2Config
Traceback (most recent call last):
  File "/home/vllm-dev/simon/bench/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
  File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/home/vllm-dev/simon/bench/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/vllm-dev/simon/bench/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 1007, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/vllm-dev/simon/bench/vllm/vllm/entrypoints/openai/api_server.py", line 259, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions