[Bug]: Llama with Lora is not starting

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text

Collecting environment information...
/workspace/my-vllm/lib64/python3.12/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.34

Python version: 3.12.1 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-372.46.1.el8_6.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          80
On-line CPU(s) list:             0-79
Vendor ID:                       GenuineIntel
Model name:                      Intel Xeon Processor (Icelake)
CPU family:                      6
Model:                           134
Thread(s) per core:              2
Core(s) per socket:              20
Socket(s):                       2
Stepping:                        0
BogoMIPS:                        5600.04
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.5 MiB (80 instances)
L1i cache:                       2.5 MiB (80 instances)
L2 cache:                        160 MiB (40 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-39
NUMA node1 CPU(s):               40-79
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu124torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev71+g6a5d85f4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	40-79	1		N/A
NIC0	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0



```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

```
vllm serve meta-llama/Llama-3.2-1B --enable-lora
```

Gives me

```
RuntimeError: The size of tensor a (2048) must match the size of tensor b (128512) at non-singleton dimension 1
```

Full stacktrace:
```
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/model_runner.py", line 1607, in execute_model
    self.set_active_loras(model_input.lora_requests,
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/model_runner.py", line 1303, in set_active_loras
    self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/worker_manager.py", line 136, in set_active_adapters
    set_active_adapters_worker(requests, mapping, self._apply_adapters,
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/adapter_commons/utils.py", line 52, in set_active_adapters_worker
    apply_adapters_func(requests)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/worker_manager.py", line 195, in _apply_adapters
    self.add_adapter(lora)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/worker_manager.py", line 211, in add_adapter
    self._adapter_manager.activate_adapter(lora_request.lora_int_id)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/models.py", line 706, in activate_adapter
    result = super().activate_adapter(lora_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/models.py", line 391, in activate_adapter
    module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/lora/layers.py", line 1430, in set_lora
    0, :lora_a.shape[1], :lora_a.shape[0]].copy_(
                                           ^^^^^^
RuntimeError: The size of tensor a (2048) must match the size of tensor b (128512) at non-singleton dimension 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/model_runner.py", line 1290, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241009-221633.pkl): The size of tensor a (2048) must match the size of tensor b (128512) at non-singleton dimension 1
Traceback (most recent call last):
  File "/workspace/my-vllm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/scripts.py", line 191, in main
    args.dispatch_function(args)
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/scripts.py", line 40, in serve
    uvloop.run(run_server(args))
  File "/workspace/my-vllm/lib64/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/workspace/my-vllm/lib64/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 548, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 106, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 193, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Llama with Lora is not starting #9207

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Llama with Lora is not starting #9207

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions