[Bug]: KeyError: 'type'. when inferencing Llama 3.2 3B Instruct

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
--2024-09-26 15:08:57--  https://github.com/raw/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25389 (25K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  24.79K  --.-KB/s    in 0s      

2024-09-26 15:08:58 (94.2 MB/s) - ‘collect_env.py’ saved [25389/25389]

Collecting environment information...
2024-09-26 15:09:06.936526: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-26 15:09:07.195978: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-26 15:09:07.266656: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-26 15:09:07.688307: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-26 15:09:09.928315: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             4399.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             256 KiB (1 instance)
L3 cache:                             55 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvcc-cu12==12.6.68
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] optree==0.12.1
[pip3] pyzmq==24.0.1
[pip3] torch==2.4.0
[pip3] torchaudio==2.4.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-1		N/A		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

Error:
```
KeyError                                  Traceback (most recent call last)
[<ipython-input-41-4c7e309514ca>](https://localhost:8080/#) in <cell line: 16>()
     14 
     15 # Create an LLM.
---> 16 llm = LLM(
     17     model="meta-llama/Llama-3.2-3B-Instruct",
     18     tokenizer="meta-llama/Llama-3.2-3B-Instruct",

4 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, **kwargs)
     91             use `float16` instead.
     92         quantization: The method used to quantize the model weights. Currently,
---> 93             we support "awq", "gptq", and "fp8" (experimental).
     94             If None, we first check the `quantization_config` attribute in the
     95             model config file. If that is None, we assume the model weights are

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args)
    238             "decoding_config=%r, observability_config=%r, "
    239             "seed=%d, served_model_name=%s, use_v2_block_manager=%s, "
--> 240             "num_scheduler_steps=%d, multi_step_stream_outputs=%s, "
    241             "enable_prefix_caching=%s, use_async_output_proc=%s, "
    242             "use_cached_outputs=%s, mm_processor_kwargs=%s)",

[/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py](https://localhost:8080/#) in create_engine_configs(self)
    196             '--tokenizer',
    197             type=nullable_str,
--> 198             default=EngineArgs.tokenizer,
    199             help='Name or path of the huggingface tokenizer to use. '
    200             'If unspecified, model name or path will be used.')

[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, download_dir, load_format, dtype, seed, revision, tokenizer_revision, max_model_len, quantization)
     91         quantization_param_path: Path to JSON file containing scaling factors.
     92             Used to load KV cache scaling factors into the model when KV cache
---> 93             type is FP8_E4M3 on ROCm (AMD GPU). In the future these will also
     94             be used to load activation and weight scaling factors when the
     95             model dtype is FP8_E4M3 on ROCm.

[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in _get_and_verify_max_len(hf_config, max_model_len)
    491             return self.hf_config.num_attention_heads
    492         if self.hf_config.model_type == "dbrx":
--> 493             return getattr(self.hf_config.attn_config, "kv_n_heads",
    494                            self.hf_config.num_attention_heads)
    495 

KeyError: 'type'
```

Reproduction (T4 on Colab):
```
!pip install vllm transformers -qU

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["HF_TOKEN"]=input()

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)

# Create an LLM.
llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    # trust_remote_code=True,
    # dtype="float16",
    # rope_scaling={"type": "extended", "factor": 8.0},
    )

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    answer = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {answer!r}")
```

There seem to be quite a few different errors with different models, like `google/gemma-2-2b-it` which gives:
```
ImportError                               Traceback (most recent call last)
[<ipython-input-43-78ec61c78513>](https://localhost:8080/#) in <cell line: 16>()
     14 
     15 # Create an LLM.
---> 16 llm = LLM(
     17     # model="meta-llama/Llama-3.2-3B-Instruct",
     18     model="google/gemma-2-2b-it",

4 frames
[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in <module>
      8 
      9 import vllm.envs as envs
---> 10 from vllm.config import (CacheConfig, DeviceConfig, LoadConfig, LoRAConfig,
     11                          ModelConfig, ObservabilityConfig, ParallelConfig,
     12                          PromptAdapterConfig, SchedulerConfig,

ImportError: cannot import name 'DeviceConfig' from 'vllm.config' (/usr/local/lib/python3.10/dist-packages/vllm/config.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
```
or `microsoft/Phi-3-mini-128k-instruct` gives:
```
AssertionError                            Traceback (most recent call last)
[<ipython-input-44-d902b9f3a8ff>](https://localhost:8080/#) in <cell line: 16>()
     14 
     15 # Create an LLM.
---> 16 llm = LLM(
     17     # model="meta-llama/Llama-3.2-3B-Instruct",
     18     model="microsoft/Phi-3-mini-128k-instruct",

4 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, **kwargs)
     91             use `float16` instead.
     92         quantization: The method used to quantize the model weights. Currently,
---> 93             we support "awq", "gptq", and "fp8" (experimental).
     94             If None, we first check the `quantization_config` attribute in the
     95             model config file. If that is None, we assume the model weights are

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args)
    238             "decoding_config=%r, observability_config=%r, "
    239             "seed=%d, served_model_name=%s, use_v2_block_manager=%s, "
--> 240             "num_scheduler_steps=%d, multi_step_stream_outputs=%s, "
    241             "enable_prefix_caching=%s, use_async_output_proc=%s, "
    242             "use_cached_outputs=%s, mm_processor_kwargs=%s)",

[/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py](https://localhost:8080/#) in create_engine_configs(self)
    196             '--tokenizer',
    197             type=nullable_str,
--> 198             default=EngineArgs.tokenizer,
    199             help='Name or path of the huggingface tokenizer to use. '
    200             'If unspecified, model name or path will be used.')

[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, download_dir, load_format, dtype, seed, revision, tokenizer_revision, max_model_len, quantization)
     91         quantization_param_path: Path to JSON file containing scaling factors.
     92             Used to load KV cache scaling factors into the model when KV cache
---> 93             type is FP8_E4M3 on ROCm (AMD GPU). In the future these will also
     94             be used to load activation and weight scaling factors when the
     95             model dtype is FP8_E4M3 on ROCm.

[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in _get_and_verify_max_len(hf_config, max_model_len)
    489             if "kv_n_heads" in self.hf_config.attn_config:
    490                 return self.hf_config.attn_config["kv_n_heads"]
--> 491             return self.hf_config.num_attention_heads
    492         if self.hf_config.model_type == "dbrx":
    493             return getattr(self.hf_config.attn_config, "kv_n_heads",

AssertionError:
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: KeyError: 'type'. when inferencing Llama 3.2 3B Instruct #8855

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: KeyError: 'type'. when inferencing Llama 3.2 3B Instruct #8855

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions