Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

jacobthebanana · 2024-03-07T21:54:47Z

Ensures the LoRA ID is a part of the hash used for prefix blocks.

jacobthebanana · 2024-03-07T22:02:20Z

Example unit test output with the revised test case and without the fix (see commit 3441735).

test_auto_prefix_caching passes when either the request specifies one lora adapter, or when no adapters was requested.
test_auto_prefix_caching does not pass when subsequent requests specify different adapters (or one request without adapter and one request with lora adapter enabled.)

$ git reset --hard 3441735
> HEAD is now at 3441735 Added test case of lora block_hash conflict.
$ pytest tests/test_cache_block_hashing.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0
plugins: forked-1.6.0, anyio-4.3.0, rerunfailures-13.0, asyncio-0.23.5
asyncio: mode=strict
collected 5 items                                                                                                                              

tests/test_cache_block_hashing.py ..FFF                                                                                                  [100%]

=================================================================== FAILURES ===================================================================
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1]

...

        for hash0, hash1 in zip(flatten_2d(hashes[0]), flatten_2d(hashes[1])):
>           assert (hash0 != hash1)
E           assert 6230683134333785342 != 6230683134333785342

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
=========================================================== short test summary info ============================================================
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
==================================================== 3 failed, 2 passed, 1 warning in 1.47s ====================================================

jacobthebanana · 2024-03-07T22:04:16Z

This PR closes #3264

Yard1

Thanks, that's exactly how it should be implemented!

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

JJEccles · 2024-11-22T08:40:43Z

Hi guys I'm looking for a solution for this issue but for the openai server calls where I request the Lora adapter in my post request. This is the command I use to get my server started:

vllm serve unsloth/Llama-3.2-3B
--tokenizer unsloth/Llama-3.2-3B
--port 8000
--max-model-len 2048
--enable-lora
--lora-modules profile_adapter=adapters_tokenizer_profile ingredientslist_adapter=adapters_tokenizer_list_ing
--max-lora-rank 64

And I was wondering If it's possible to then either adjust this server command or change something in the request for inference on the user side to be able to stop the caching affecting the responses when directly switching from one to another in inference calls. I'm hoping there is something I can add in the command to open the server that can solve this issue. If anyone could point me in the right direction It would be much appreciated!

jacobthebanana added 2 commits March 7, 2024 15:49

Added test case of lora block_hash conflict.

3441735

LoRA block_hash conflict: added test case and suggested fix

7d1b048

jacobthebanana mentioned this pull request Mar 7, 2024

Automatic Prefix Caching (#2792) might conflict with multi-LoRA (#1804) #3264

Closed

jacobthebanana marked this pull request as ready for review March 7, 2024 22:02

Yard1 approved these changes Mar 7, 2024

View reviewed changes

Yard1 enabled auto-merge (squash) March 7, 2024 22:06

Yard1 merged commit 8cbba46 into vllm-project:main Mar 7, 2024

AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

fd6e57e

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

12634be

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Uh oh!

jacobthebanana commented Mar 7, 2024 •

edited by Yard1

Loading

Uh oh!

jacobthebanana commented Mar 7, 2024

Uh oh!

jacobthebanana commented Mar 7, 2024

Uh oh!

Yard1 left a comment

Uh oh!

JJEccles commented Nov 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Uh oh!

Conversation

jacobthebanana commented Mar 7, 2024 • edited by Yard1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobthebanana commented Mar 7, 2024

Uh oh!

jacobthebanana commented Mar 7, 2024

Uh oh!

Yard1 left a comment

Choose a reason for hiding this comment

Uh oh!

JJEccles commented Nov 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobthebanana commented Mar 7, 2024 •

edited by Yard1

Loading