Skip to content

Conversation

zhuohan123
Copy link
Member

Speed before this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tens
or_parallel_size=1)
2023-03-31 14:17:41,580 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.18s/it]
Avg latency: 5.184098243713379 seconds

Speed after this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-03-31 15:20:04,885 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.49s/it]
Avg latency: 3.492198626200358 seconds

@zhuohan123 zhuohan123 requested a review from WoosukKwon March 31, 2023 15:32
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the effort.

@zhuohan123 zhuohan123 merged commit c45f3c3 into main Mar 31, 2023
@zhuohan123 zhuohan123 deleted the optimize-tp-speed branch June 18, 2023 07:22
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
AdrianAbeyta referenced this pull request in ROCm/vllm Mar 8, 2024
Rebase fp8_kv branch with upstream (3-07-2024)
z103cb referenced this pull request in z103cb/opendatahub_vllm Apr 22, 2024
These Dockerfile changes:
- Update the release stage to work with the recently refactored
`requirements-common.txt` / `requirements-cuda.txt` split
- Fixup the kernel compilation in the `build` stage to correctly pick up
cuda
- Install the kernels from this docker build rather than pulling a
precompiled wheel. We can swap that back once a new wheel is available
with the correct pytorch version + updated interfaces

---------

Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024
[ROCm] adding a missing triton autotune config
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
### What this PR does / why we need it?
Add dispatch key for NPU, so that the log could be print correctly.

Now
```
executor_base.py:110] # CPU blocks: 220478, # CPU blocks: 21845
```

After this pr
```
executor_base.py:110] # NPU blocks: 220478, # CPU blocks: 21845
```

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed and log printed as above

Signed-off-by: MengqingCao <[email protected]>
robertgshaw2-redhat added a commit that referenced this pull request Jul 7, 2025
Load balance across multiple workers
zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 5, 2025
zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 6, 2025
heheda12345 added a commit to heheda12345/vllm that referenced this pull request Sep 29, 2025
* prefill mla

Signed-off-by: Chen Zhang <[email protected]>

* can run now

Signed-off-by: Chen Zhang <[email protected]>

* tmp

Signed-off-by: Chen Zhang <[email protected]>

* can output the first token

Signed-off-by: Chen Zhang <[email protected]>

* fix bug

Signed-off-by: Chen Zhang <[email protected]>

* remove some debug

Signed-off-by: Chen Zhang <[email protected]>

* update

Signed-off-by: Chen Zhang <[email protected]>

* hack through cu_seqlen_ks exploding issue

* update basic.py

Signed-off-by: Chen Zhang <[email protected]>

* remove some unnecessary changes

Signed-off-by: Chen Zhang <[email protected]>

* clean up

Signed-off-by: Chen Zhang <[email protected]>

---------

Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: Yongye Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants