Skip to content

Conversation

WoosukKwon
Copy link
Collaborator

Fixes #6269

However, I'm still not sure how #4645 passed the Neuron CI test.

@WoosukKwon WoosukKwon added the aws-neuron Related to AWS Inferentia & Trainium label Jul 10, 2024
@WoosukKwon
Copy link
Collaborator Author

@liangfu #6269 might mean that the neuron CI is not working correctly. Could you please take a look?

@areanddee
Copy link

@WoosukKwon Thanks for the prompt response to my issue #6269! When the PR is approved, can you please follow up with a procedure to update my install to run the patched vLLM on Neuron systems? I urgently need this for a project I am working on.

@liangfu
Copy link
Contributor

liangfu commented Jul 10, 2024

Thanks for the fix. The current neuron CI only tests online inference, the offline inference capability is currently not tested for neuron backend.

@areanddee
Copy link

Thanks for the fix. The current neuron CI only tests online inference, the offline inference capability is currently not tested for neuron backend.

Well, the online inference also appears to be broken:

python -m vllm.entrypoints.openai.api_server
--model facebook/opt-125m
WARNING 07-10 22:25:24 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 07-10 22:25:29 api_server.py:206] vLLM API server version 0.5.1
INFO 07-10 22:25:29 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-10 22:25:31 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in
engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
engine = cls(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 360, in init
self.engine = self._init_engine(*args, **kwargs)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
return engine_class(*args, **kwargs)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 243, in init
self.model_executor = executor_class(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 128, in init
super().init(model_config, cache_config, parallel_config,
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 42, in init
self._init_executor()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/neuron_executor.py", line 21, in _init_executor
self._init_worker()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/neuron_executor.py", line 26, in _init_worker
self.driver_worker = NeuronWorker(
TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

@WoosukKwon
Copy link
Collaborator Author

@liangfu As @areanddee pointed out, the error happens when NeuronExecutor is initialized since the neuron executor does not implement the abstract methods. The error should happen for both offline and online entry points.

@areanddee
Copy link

Saw that the patch to #6313 was merged to main, so I did a git pull origin main to update to the latest. The behavior seen in #6269 is still present. I am posting this here because it was speculated that #6313 would fix #6269 as well.

adityagoel14 pushed a commit to adityagoel14/vllm-torchrun-test that referenced this pull request Jul 11, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aws-neuron Related to AWS Inferentia & Trainium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

f[Bug]: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

3 participants