Optimize tensor parallel execution speed #17

zhuohan123 · 2023-03-31T15:27:00Z

Speed before this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tens
or_parallel_size=1)
2023-03-31 14:17:41,580 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.18s/it]
Avg latency: 5.184098243713379 seconds

Speed after this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-03-31 15:20:04,885 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.49s/it]
Avg latency: 3.492198626200358 seconds

WoosukKwon

Awesome! Thanks for the effort.

benchmark/benchmark_latency.py

Rebase fp8_kv branch with upstream (3-07-2024)

These Dockerfile changes: - Update the release stage to work with the recently refactored `requirements-common.txt` / `requirements-cuda.txt` split - Fixup the kernel compilation in the `build` stage to correctly pick up cuda - Install the kernels from this docker build rather than pulling a precompiled wheel. We can swap that back once a new wheel is available with the correct pytorch version + updated interfaces --------- Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>

[ROCm] adding a missing triton autotune config

### What this PR does / why we need it? Add dispatch key for NPU, so that the log could be print correctly. Now ``` executor_base.py:110] # CPU blocks: 220478, # CPU blocks: 21845 ``` After this pr ``` executor_base.py:110] # NPU blocks: 220478, # CPU blocks: 21845 ``` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed and log printed as above Signed-off-by: MengqingCao <[email protected]>

Load balance across multiple workers

* prefill mla Signed-off-by: Chen Zhang <[email protected]> * can run now Signed-off-by: Chen Zhang <[email protected]> * tmp Signed-off-by: Chen Zhang <[email protected]> * can output the first token Signed-off-by: Chen Zhang <[email protected]> * fix bug Signed-off-by: Chen Zhang <[email protected]> * remove some debug Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * hack through cu_seqlen_ks exploding issue * update basic.py Signed-off-by: Chen Zhang <[email protected]> * remove some unnecessary changes Signed-off-by: Chen Zhang <[email protected]> * clean up Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Yongye Zhu <[email protected]>

zhuohan123 added 2 commits March 31, 2023 15:25

Optimize tensor parallel execution speed

a32f244

add more files

c3e6bce

zhuohan123 requested a review from WoosukKwon March 31, 2023 15:32

WoosukKwon approved these changes Mar 31, 2023

View reviewed changes

WoosukKwon reviewed Mar 31, 2023

View reviewed changes

benchmark/benchmark_latency.py Outdated Show resolved Hide resolved

nit

2bea93e

zhuohan123 merged commit c45f3c3 into main Mar 31, 2023

zhuohan123 deleted the optimize-tp-speed branch June 18, 2023 07:22

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Optimize tensor parallel execution speed (vllm-project#17)

ad3d36f

AdrianAbeyta referenced this pull request in ROCm/vllm Mar 8, 2024

Merge pull request #17 from ROCm/IFU-2024-03-01-fp8-kv

b3d81e0

Rebase fp8_kv branch with upstream (3-07-2024)

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#17 from ROCm/triton-config-fix

bebcbe6

[ROCm] adding a missing triton autotune config

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

robertgshaw2-redhat added a commit that referenced this pull request Jul 7, 2025

Merge pull request #17 from praveingk/batching

39e6bd1

Load balance across multiple workers

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 5, 2025

Add TRT-LLM Attention Sink and MXFP4 MoE (vllm-project#17)

78e69f6

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 6, 2025

Add TRT-LLM Attention Sink and MXFP4 MoE (vllm-project#17)

2cc41a7

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize tensor parallel execution speed #17

Optimize tensor parallel execution speed #17

Uh oh!

zhuohan123 commented Mar 31, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize tensor parallel execution speed #17

Optimize tensor parallel execution speed #17

Uh oh!

Conversation

zhuohan123 commented Mar 31, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!