Skip to content

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 7, 2025

SUMMARY:

  • support TPU for compressed-tensors w8a8 models.
  • To run, just load a W8A8 model:
from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", max_model_len=2048)
model.generate("Hello my name is")

TESTING:

  • verified accuracy on TPU for Llama-8B on TP=1 (exact score as GPU)
  • verified accuracy on TPU for Llama-8B on TP=4 (exact score as GPU)
  • verified accuracy on TPU for Llama-70B on TP=1 (exact score as GPU)
  • verified accuracy on TPU for Qwen on TP=1 (exact score as GPU) --- note: bias in model
  • confirmed all schemes still work on GPU, including:
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Asym
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Sym
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Asym
  • add Llama TP=1 tests to CI/CD
    • FOLLOW UP: Add more than one model once we enable lm-eval framework on TPU
    • FOLLOW UP: add TP>1 once we enable this machine type in the CI
  • figure out workaround for user warning re: cond

FOLLOW UP

  • [TPU] Mixed precision
  • [TPU] Estimated memory usage is elevated due to peak_bytes capturing some intermediate tensors, fix it.
  • [Software Quality] Add TritonScaledMMLinear abstraction
  • [Software Quality] Convert Fp8 methods to use Kernel abstraction

@robertgshaw2-redhat
Copy link
Collaborator Author

@mgoin this is ready to go.

@@ -0,0 +1,74 @@
from typing import List, Optional, Type
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE for reviewer - this file is not changed, it is just moved

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, excellent work

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) January 8, 2025 18:31
@robertgshaw2-redhat robertgshaw2-redhat merged commit 56fe4c2 into vllm-project:main Jan 8, 2025
56 checks passed
gshtras added a commit to ROCm/vllm that referenced this pull request Jan 14, 2025
* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676)

Signed-off-by: Jee Jee Li <[email protected]>

* [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690)

Signed-off-by: Chen Zhang <[email protected]>

* [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776)

Signed-off-by: DarkLight1337 <[email protected]>

* [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648)

Signed-off-by: yisheng <[email protected]>

* [Doc][3/N] Reorganize Serving section (vllm-project#11766)

Signed-off-by: DarkLight1337 <[email protected]>

* [Kernel][LoRA]Punica prefill  kernels fusion (vllm-project#11234)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Abatom <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>

* [Bugfix] Update attention interface in `Whisper` (vllm-project#11784)

Signed-off-by: Roger Wang <[email protected]>

* [CI] Fix neuron CI and run offline tests (vllm-project#11779)

Signed-off-by: Liangfu Chen <[email protected]>

* fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768)

* [Doc] Create a vulnerability management team (vllm-project#9925)

Signed-off-by: Russell Bryant <[email protected]>

* [CI][CPU] adding build number to docker image name (vllm-project#11788)

Signed-off-by: Yuan Zhou <[email protected]>

* [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798)

Signed-off-by: Roger Wang <[email protected]>

* [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800)

Signed-off-by: DarkLight1337 <[email protected]>

* [doc] add doc to explain how to use uv (vllm-project#11773)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Support audio language models on V1 (vllm-project#11733)

Signed-off-by: Roger Wang <[email protected]>

* [doc] update how pip can install nightly wheels (vllm-project#11806)

Signed-off-by: youkaichao <[email protected]>

* [Doc] Add note to `gte-Qwen2` models (vllm-project#11808)

Signed-off-by: DarkLight1337 <[email protected]>

* [optimization] remove python function call for custom op (vllm-project#11750)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] update the prefix for qwen2 (vllm-project#11795)

Co-authored-by: jiadi.jjd <[email protected]>

* [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417)

Signed-off-by: Sourashis Roy <[email protected]>

* [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794)

* [Doc] Group examples into categories (vllm-project#11782)

Signed-off-by: Harry Mellor <[email protected]>

* [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] sort torch profiler table by kernel timing (vllm-project#11813)

* Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824)

* Fixed docker build for ppc64le (vllm-project#11518)

Signed-off-by: Nishidha Panpaliya <[email protected]>

* [OpenVINO] Fixed Docker.openvino build (vllm-project#11732)

Signed-off-by: Ilya Lavrenov <[email protected]>

* [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810)

Signed-off-by: Jee Jee Li <[email protected]>

* [Docs] reorganize sponsorship page (vllm-project#11639)

Signed-off-by: simon-mo <[email protected]>

* [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825)

Signed-off-by: DarkLight1337 <[email protected]>

* [misc] improve memory profiling (vllm-project#11809)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [doc] update wheels url (vllm-project#11830)

Signed-off-by: youkaichao <[email protected]>

* [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833)

* [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [torch.compile] consider relevant code in compilation cache (vllm-project#11614)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Reorganize profiling/processing-related code (vllm-project#11812)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Move examples into categories (vllm-project#11840)

Signed-off-by: Harry Mellor <[email protected]>

* [Doc][4/N] Reorganize API Reference (vllm-project#11843)

Signed-off-by: DarkLight1337 <[email protected]>

* [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836)

Signed-off-by: jiang1.li <[email protected]>

* [Bugfix][XPU] fix silu_and_mul (vllm-project#11823)

Signed-off-by: yan ma <[email protected]>

* [Misc] Move some model utils into vision file (vllm-project#11848)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Expand Multimodal API Reference (vllm-project#11852)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc]add some explanations for BlockHashType (vllm-project#11847)

* [TPU][Quantization] TPU `W8A8` (vllm-project#11785)

Co-authored-by: Woosuk Kwon <[email protected]>

* [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698)

Signed-off-by: Randall Smith <[email protected]>

* [Docs] Add Google Cloud Meetup (vllm-project#11864)

* [CI] Turn on basic correctness tests for V1 (vllm-project#10864)

* treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815)

Signed-off-by: Max de Bayser <[email protected]>

* [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849)

Signed-off-by: mgoin <[email protected]>

* [Misc] Move `print_*_once` from utils to logger (vllm-project#11298)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>

* [Doc] Intended links Python multiprocessing library (vllm-project#11878)

* [perf]fix current stream (vllm-project#11870)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875)

Signed-off-by: Ye Qi <[email protected]>
Co-authored-by: yeq <[email protected]>

* [Doc] Add model development API Reference (vllm-project#11884)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] Allow platform specify attention backend (vllm-project#11609)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>

* [ci]try to fix flaky multi-step tests (vllm-project#11894)

Signed-off-by: youkaichao <[email protected]>

* [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891)

Signed-off-by: DarkLight1337 <[email protected]>

* [Docs] Add Modal to deployment frameworks (vllm-project#11907)

* [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896)

Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Show default pooling method in a table (vllm-project#11904)

Signed-off-by: DarkLight1337 <[email protected]>

* [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677)

Signed-off-by: Chen Zhang <[email protected]>

* [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916)

Signed-off-by: Kunshang Ji <[email protected]>

* [ci] fix gh200 tests (vllm-project#11919)

Signed-off-by: youkaichao <[email protected]>

* [misc] remove python function call for custom activation op (vllm-project#11885)

Co-authored-by: youkaichao <[email protected]>

* [platform] support pytorch custom op pluggable (vllm-project#11328)

Signed-off-by: wangxiyuan <[email protected]>

* Replace "online inference" with "online serving" (vllm-project#11923)

Signed-off-by: Harry Mellor <[email protected]>

* [ci] Fix sampler tests (vllm-project#11922)

Signed-off-by: youkaichao <[email protected]>

* [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] support custom torch.compile backend key (vllm-project#11318)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Doc] Rename offline inference examples (vllm-project#11927)

Signed-off-by: Harry Mellor <[email protected]>

* [Docs] Fix docstring in `get_ip` function (vllm-project#11932)

Signed-off-by: Kuntai Du <[email protected]>

* Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933)

Signed-off-by: Kuntai Du <[email protected]>

* [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831)

Signed-off-by: jiang1.li <[email protected]>

* [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930)

Signed-off-by: Isotr0py <[email protected]>

* [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920)

Signed-off-by: Ren MinMin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>

* [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939)

Signed-off-by: Travis Johnson <[email protected]>

* [mypy] Fix mypy warnings in api_server.py (vllm-project#11941)

Signed-off-by: Fred Reiss <[email protected]>

* [ci] fix broken distributed-tests-4-gpus (vllm-project#11937)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672)

Signed-off-by: Sungjae Lee <[email protected]>

* [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921)

Signed-off-by: shaochangxu.scx <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>

* [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Basic guide for writing unit tests for new models (vllm-project#11951)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix RobertaModel loading (vllm-project#11940)

Signed-off-by: NickLucche <[email protected]>

* [Model] Add cogagent model support vLLM (vllm-project#11742)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <[email protected]>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <[email protected]>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: [email protected] <[email protected]>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: [email protected] <[email protected]>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <[email protected]>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <[email protected]>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <[email protected]>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <[email protected]>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <[email protected]>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <[email protected]>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <[email protected]>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <[email protected]>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <[email protected]>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <[email protected]>

* Using list

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Trying to make scales work with compileable attention

* Docs lint

---------

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Liangfu Chen <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Nishidha Panpaliya <[email protected]>
Signed-off-by: Ilya Lavrenov <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Ye Qi <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Kuntai Du <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Ren MinMin <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Fred Reiss <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: shaochangxu.scx <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Chenguang Li <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: elijah <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: jiangjiadi <[email protected]>
Co-authored-by: jiadi.jjd <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Jie Fu (傅杰) <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: WangErXiao <[email protected]>
Co-authored-by: Nishidha <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Ye (Charlotte) Qi <[email protected]>
Co-authored-by: yeq <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: Charles Frye <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: cennn <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: minmin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Fred Reiss <[email protected]>
Co-authored-by: Sungjae Lee <[email protected]>
Co-authored-by: shaochangxu <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>
Co-authored-by: Nicolò Lucchesi <[email protected]>
Co-authored-by: sixgod <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: Akshat Tripathi <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Avshalom Manevich <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: Siyuan Li <[email protected]>
Co-authored-by: Concurrensee <[email protected]>
Co-authored-by: Chenguang Li <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Shanshan Shen <[email protected]>
Co-authored-by: elijah <[email protected]>
hongxiayang pushed a commit to ROCm/vllm that referenced this pull request Jan 15, 2025
* [Misc] Move weights mapper (vllm-project#11443)

Signed-off-by: Jee Jee Li <[email protected]>

* [Bugfix] Fix issues in CPU build Dockerfile. Fixes vllm-project#9182 (vllm-project#11435)

Signed-off-by: Yuan Tang <[email protected]>

* [Model] Automatic conversion of classification and reward models (vllm-project#11469)

Signed-off-by: DarkLight1337 <[email protected]>

* [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (vllm-project#11472)

* [Misc] Update disaggregation benchmark scripts and test logs (vllm-project#11456)

Signed-off-by: Jiaxin Shan <[email protected]>

* [Frontend] Enable decord to load video from base64 (vllm-project#11492)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Improve GitHub links (vllm-project#11491)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] Move some multimodal utils to modality-specific modules (vllm-project#11494)

Signed-off-by: DarkLight1337 <[email protected]>

* Mypy checking for vllm/compilation (vllm-project#11496)

Signed-off-by: lucast2021 <[email protected]>
Co-authored-by: lucast2021 <[email protected]>

* [Misc][LoRA] Fix LoRA weight mapper (vllm-project#11495)

Signed-off-by: Jee Jee Li <[email protected]>

* [Doc] Add `QVQ` and `QwQ` to the list of supported models (vllm-project#11509)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (vllm-project#10681)

Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>

* [Model]  Modify MolmoForCausalLM MLP  (vllm-project#11510)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Add placeholder module (vllm-project#11501)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Add video example to openai client for multimodal (vllm-project#11521)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [1/N] API Server  (Remove Proxy) (vllm-project#11529)

* [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (vllm-project#11523)

Signed-off-by: mgoin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>

* [2/N] API Server: Avoid ulimit footgun (vllm-project#11530)

* Deepseek v3 (vllm-project#11502)

Signed-off-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: robertgshaw2-neuralmagic <[email protected]>

* [Docs] Document Deepseek V3 support (vllm-project#11535)

Signed-off-by: simon-mo <[email protected]>

* Update openai_compatible_server.md (vllm-project#11536)

Co-authored-by: Simon Mo <[email protected]>

* [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (vllm-project#11394)

Signed-off-by: Woosuk Kwon <[email protected]>

* [V1] Fix yapf (vllm-project#11538)

Signed-off-by: Woosuk Kwon <[email protected]>

* [CI] Fix broken CI (vllm-project#11543)

* [misc] fix typing (vllm-project#11540)

Signed-off-by: youkaichao <[email protected]>

* [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (vllm-project#11534)

* [BugFix] Fix quantization for all other methods (vllm-project#11547)

* [Platform] Move model arch check to platform (vllm-project#11503)

Signed-off-by: Mengqing Cao <[email protected]>

* Update deploying_with_k8s.md with AMD ROCm GPU example (vllm-project#11465)

Signed-off-by: Alex He <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Bugfix] Fix TeleChat2ForCausalLM weights mapper (vllm-project#11546)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Abstract the logic for reading and writing media content (vllm-project#11527)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc]  Add xgrammar in doc (vllm-project#11549)

Signed-off-by: ccjincong <[email protected]>

* [VLM] Support caching in merged multi-modal processor (vllm-project#11396)

Signed-off-by: DarkLight1337 <[email protected]>

* [MODEL] LoRA support for Jamba model (vllm-project#11209)

Signed-off-by: Erez Schwartz <[email protected]>

* [Misc]Add BNB quantization for MolmoForCausalLM  (vllm-project#11551)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (vllm-project#11566)

Signed-off-by: Isotr0py <[email protected]>

* [Bugfix] Fix for ROCM compressed tensor support (vllm-project#11561)

* [Doc] Update mllama example based on official doc (vllm-project#11567)

Signed-off-by: Chen Zhang <[email protected]>

* [V1] [4/N] API Server: ZMQ/MP Utilities (vllm-project#11541)

* [Bugfix] Last token measurement fix (vllm-project#11376)

Signed-off-by: rajveerb <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Model] Support InternLM2 Reward models (vllm-project#11571)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Model] Remove hardcoded image tokens ids from Pixtral (vllm-project#11582)

Signed-off-by: Roger Wang <[email protected]>

* [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (vllm-project#11515)

Signed-off-by: hjwei <[email protected]>

* [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (vllm-project#11581)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Doc] Minor documentation fixes (vllm-project#11580)

Signed-off-by: DarkLight1337 <[email protected]>

* [bugfix] interleaving sliding window for cohere2 model (vllm-project#11583)

Signed-off-by: youkaichao <[email protected]>

* [V1] [5/N] API Server: unify `Detokenizer` and  `EngineCore` input (vllm-project#11545)

Signed-off-by: [email protected] <[email protected]>

* [Doc] Convert list tables to MyST (vllm-project#11594)

Signed-off-by: DarkLight1337 <[email protected]>

* [v1][bugfix] fix cudagraph with inplace buffer assignment (vllm-project#11596)

Signed-off-by: youkaichao <[email protected]>

* [Misc] KV cache transfer connector registry (vllm-project#11481)

Signed-off-by: KuntaiDu <[email protected]>

* Remove print statement in DeepseekScalingRotaryEmbedding (vllm-project#11604)

* [v1] fix compilation cache (vllm-project#11598)

Signed-off-by: youkaichao <[email protected]>

* [Docker] bump up neuron sdk v2.21 (vllm-project#11593)

Signed-off-by: Liangfu Chen <[email protected]>

* [Build][Kernel] Update CUTLASS to v3.6.0 (vllm-project#11607)

Signed-off-by: Tyler Michael Smith <[email protected]>

* [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (vllm-project#11618)

Signed-off-by: jiang1.li <[email protected]>

* [platforms] enable platform plugins (vllm-project#11602)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Abstract out multi-modal data parsing in merged processor (vllm-project#11620)

Signed-off-by: DarkLight1337 <[email protected]>

* [V1] [6/N] API Server: Better Shutdown (vllm-project#11586)

* [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (vllm-project#11631)

* [benchmark] Remove dependency for H100 benchmark step (vllm-project#11572)

* [Model][LoRA]LoRA support added for MolmoForCausalLM (vllm-project#11439)

Signed-off-by: Matthias Vogler <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Bugfix] Fix OpenAI parallel sampling when using xgrammar (vllm-project#11637)

Signed-off-by: mgoin <[email protected]>

* [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (vllm-project#6909)

Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (vllm-project#11565)

* [V1] Simpify vision block hash for prefix caching by removing offset from hash (vllm-project#11646)

* [V1][VLM] V1 support for selected single-image models. (vllm-project#11632)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Benchmark] Add benchmark script for CPU offloading  (vllm-project#11533)

Signed-off-by: ApostaC <[email protected]>
Co-authored-by: KuntaiDu <[email protected]>

* [Bugfix][Refactor] Unify model management in frontend (vllm-project#11660)

Signed-off-by: Joe Runde <[email protected]>

* [VLM] Add max-count checking in data parser for single image models (vllm-project#11661)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Misc] Optimize Qwen2-VL LoRA test (vllm-project#11663)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Replace space with - in the file names (vllm-project#11667)

Signed-off-by: Lu Fang <[email protected]>

* [Doc] Fix typo (vllm-project#11666)

Signed-off-by: Kazuhiro Serizawa <[email protected]>

* [V1] Implement Cascade Attention (vllm-project#11635)

Signed-off-by: Woosuk Kwon <[email protected]>

* [VLM] Move supported limits and max tokens to merged multi-modal processor (vllm-project#11669)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (vllm-project#11674)

Signed-off-by: DarkLight1337 <[email protected]>

* [mypy] Pass type checking in vllm/inputs (vllm-project#11680)

Signed-off-by: Tobias Pitters <[email protected]>

* [VLM] Merged multi-modal processor for LLaVA-NeXT (vllm-project#11682)

Signed-off-by: DarkLight1337 <[email protected]>

* According to vllm.EngineArgs, the name should be distributed_executor_backend (vllm-project#11689)

* [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (vllm-project#10013)

Signed-off-by: Kathy Yu <[email protected]>

* [V1][Minor] Optimize token_ids_cpu copy (vllm-project#11692)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Bugfix] Change kv scaling factor by param json on nvidia gpu (vllm-project#11688)

Signed-off-by: bjmsong <[email protected]>
Co-authored-by: bjmsong <[email protected]>

* Resolve race conditions in Marlin kernel (vllm-project#11493)

Signed-off-by: wchen61 <[email protected]>

* [Misc] Minimum requirements for SageMaker compatibility (vllm-project#11576)

* Update default max_num_batch_tokens for chunked prefill (vllm-project#11694)

* [Bugfix] Check chain_speculative_sampling before calling it (vllm-project#11673)

Signed-off-by: Lu Fang <[email protected]>

* [perf-benchmark] Fix dependency for steps in benchmark pipeline (vllm-project#11710)

* [Model] Whisper model implementation (vllm-project#11280)

Co-authored-by: Aurick Qiao <[email protected]>

* [V1] Simplify Shutdown (vllm-project#11659)

* [Bugfix] Fix ColumnParallelLinearWithLoRA slice (vllm-project#11708)

Signed-off-by: ZincCat <[email protected]>

* [V1] Improve TP>1 Error Handling + Stack Trace (vllm-project#11721)

Co-authored-by: Tyler Michael Smith <[email protected]>

* [Misc]Add BNB quantization for Qwen2VL (vllm-project#11719)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* Update requirements-tpu.txt to support python 3.9 and 3.11 (vllm-project#11695)

Signed-off-by: mgoin <[email protected]>

* [V1] Chore: cruft removal (vllm-project#11724)

* [V1] log GPU blocks num for MultiprocExecutor (vllm-project#11656)

* Update tool_calling.md (vllm-project#11701)

* Update bnb.md with example for OpenAI (vllm-project#11718)

* [V1] Add `RayExecutor` support for `AsyncLLM` (api server) (vllm-project#11712)

* [V1] Add kv cache utils tests. (vllm-project#11513)

Signed-off-by: xcnick <[email protected]>

* [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (vllm-project#11233)

Signed-off-by: Yan Burman <[email protected]>
Signed-off-by: Ido Asraff <[email protected]>

* [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (vllm-project#11717)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix precision error in LLaVA-NeXT (vllm-project#11735)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Remove unnecessary weight initialization logic (vllm-project#11736)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Bugfix][V1] Fix test_kv_cache_utils.py (vllm-project#11738)

Signed-off-by: Jee Jee Li <[email protected]>

* [MISC] Replace c10::optional with std::optional (vllm-project#11730)

Signed-off-by: Lu Fang <[email protected]>

* [distributed] remove pynccl's redundant stream (vllm-project#11744)

* fix: [doc] fix typo (vllm-project#11751)

Co-authored-by: Lancer <[email protected]>

* [Frontend] Improve `StreamingResponse` Exception Handling (vllm-project#11752)

* [distributed] remove pynccl's redundant change_state (vllm-project#11749)

* [Doc] [1/N] Reorganize Getting Started section (vllm-project#11645)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Remove block size constraint (vllm-project#11723)

* [V1] Add BlockTable class (vllm-project#11693)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Misc] Fix typo for valid_tool_parses  (vllm-project#11753)

Signed-off-by: Rui Qiao <[email protected]>

* [V1] Refactor get_executor_cls (vllm-project#11754)

* [mypy] Forward pass function type hints in lora (vllm-project#11740)

Signed-off-by: lucast2021 <[email protected]>
Co-authored-by: lucast2021 <[email protected]>

* k8s-config: Update the secret to use stringData (vllm-project#11679)

Signed-off-by: Suraj Deshmukh <[email protected]>

* [VLM] Separate out profiling-related logic (vllm-project#11746)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc][2/N] Reorganize Models and Usage sections (vllm-project#11755)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix max image size for LLaVA-Onevision (vllm-project#11769)

Signed-off-by: Roger Wang <[email protected]>

* [doc] explain how to add interleaving sliding window support (vllm-project#11771)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676)

Signed-off-by: Jee Jee Li <[email protected]>

* [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690)

Signed-off-by: Chen Zhang <[email protected]>

* format

* [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* deepseek overflow fix (#349)

* [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776)

Signed-off-by: DarkLight1337 <[email protected]>

* [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648)

Signed-off-by: yisheng <[email protected]>

* [Doc][3/N] Reorganize Serving section (vllm-project#11766)

Signed-off-by: DarkLight1337 <[email protected]>

* [Kernel][LoRA]Punica prefill  kernels fusion (vllm-project#11234)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Abatom <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>

* [Bugfix] Update attention interface in `Whisper` (vllm-project#11784)

Signed-off-by: Roger Wang <[email protected]>

* [CI] Fix neuron CI and run offline tests (vllm-project#11779)

Signed-off-by: Liangfu Chen <[email protected]>

* fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768)

* [Doc] Create a vulnerability management team (vllm-project#9925)

Signed-off-by: Russell Bryant <[email protected]>

* [CI][CPU] adding build number to docker image name (vllm-project#11788)

Signed-off-by: Yuan Zhou <[email protected]>

* [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798)

Signed-off-by: Roger Wang <[email protected]>

* [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800)

Signed-off-by: DarkLight1337 <[email protected]>

* [doc] add doc to explain how to use uv (vllm-project#11773)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Support audio language models on V1 (vllm-project#11733)

Signed-off-by: Roger Wang <[email protected]>

* [doc] update how pip can install nightly wheels (vllm-project#11806)

Signed-off-by: youkaichao <[email protected]>

* [Doc] Add note to `gte-Qwen2` models (vllm-project#11808)

Signed-off-by: DarkLight1337 <[email protected]>

* [optimization] remove python function call for custom op (vllm-project#11750)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] update the prefix for qwen2 (vllm-project#11795)

Co-authored-by: jiadi.jjd <[email protected]>

* [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417)

Signed-off-by: Sourashis Roy <[email protected]>

* [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794)

* [Doc] Group examples into categories (vllm-project#11782)

Signed-off-by: Harry Mellor <[email protected]>

* [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] sort torch profiler table by kernel timing (vllm-project#11813)

* Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824)

* Fixed docker build for ppc64le (vllm-project#11518)

Signed-off-by: Nishidha Panpaliya <[email protected]>

* [OpenVINO] Fixed Docker.openvino build (vllm-project#11732)

Signed-off-by: Ilya Lavrenov <[email protected]>

* [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810)

Signed-off-by: Jee Jee Li <[email protected]>

* [Docs] reorganize sponsorship page (vllm-project#11639)

Signed-off-by: simon-mo <[email protected]>

* [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825)

Signed-off-by: DarkLight1337 <[email protected]>

* [misc] improve memory profiling (vllm-project#11809)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [doc] update wheels url (vllm-project#11830)

Signed-off-by: youkaichao <[email protected]>

* [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833)

* [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [torch.compile] consider relevant code in compilation cache (vllm-project#11614)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Reorganize profiling/processing-related code (vllm-project#11812)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Move examples into categories (vllm-project#11840)

Signed-off-by: Harry Mellor <[email protected]>

* [Doc][4/N] Reorganize API Reference (vllm-project#11843)

Signed-off-by: DarkLight1337 <[email protected]>

* [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836)

Signed-off-by: jiang1.li <[email protected]>

* [Bugfix][XPU] fix silu_and_mul (vllm-project#11823)

Signed-off-by: yan ma <[email protected]>

* [Misc] Move some model utils into vision file (vllm-project#11848)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Expand Multimodal API Reference (vllm-project#11852)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc]add some explanations for BlockHashType (vllm-project#11847)

* [TPU][Quantization] TPU `W8A8` (vllm-project#11785)

Co-authored-by: Woosuk Kwon <[email protected]>

* [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698)

Signed-off-by: Randall Smith <[email protected]>

* [Docs] Add Google Cloud Meetup (vllm-project#11864)

* Revert nccl changes (#351)

* Revert "[distributed] remove pynccl's redundant change_state (vllm-project#11749)"

This reverts commit 9e764e7.

* Revert "[distributed] remove pynccl's redundant stream (vllm-project#11744)"

This reverts commit 635b897.

* [CI] Turn on basic correctness tests for V1 (vllm-project#10864)

* treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815)

Signed-off-by: Max de Bayser <[email protected]>

* [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849)

Signed-off-by: mgoin <[email protected]>

* [Misc] Move `print_*_once` from utils to logger (vllm-project#11298)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>

* [Doc] Intended links Python multiprocessing library (vllm-project#11878)

* [perf]fix current stream (vllm-project#11870)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875)

Signed-off-by: Ye Qi <[email protected]>
Co-authored-by: yeq <[email protected]>

* [Doc] Add model development API Reference (vllm-project#11884)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] Allow platform specify attention backend (vllm-project#11609)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>

* [ci]try to fix flaky multi-step tests (vllm-project#11894)

Signed-off-by: youkaichao <[email protected]>

* [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891)

Signed-off-by: DarkLight1337 <[email protected]>

* fp8 support (#352)

Co-authored-by: Yida Wu <[email protected]>

* [Docs] Add Modal to deployment frameworks (vllm-project#11907)

* [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896)

Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Show default pooling method in a table (vllm-project#11904)

Signed-off-by: DarkLight1337 <[email protected]>

* [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677)

Signed-off-by: Chen Zhang <[email protected]>

* [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916)

Signed-off-by: Kunshang Ji <[email protected]>

* [ci] fix gh200 tests (vllm-project#11919)

Signed-off-by: youkaichao <[email protected]>

* [misc] remove python function call for custom activation op (vllm-project#11885)

Co-authored-by: youkaichao <[email protected]>

* [platform] support pytorch custom op pluggable (vllm-project#11328)

Signed-off-by: wangxiyuan <[email protected]>

* Replace "online inference" with "online serving" (vllm-project#11923)

Signed-off-by: Harry Mellor <[email protected]>

* [ci] Fix sampler tests (vllm-project#11922)

Signed-off-by: youkaichao <[email protected]>

* [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] support custom torch.compile backend key (vllm-project#11318)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Doc] Rename offline inference examples (vllm-project#11927)

Signed-off-by: Harry Mellor <[email protected]>

* [Docs] Fix docstring in `get_ip` function (vllm-project#11932)

Signed-off-by: Kuntai Du <[email protected]>

* Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933)

Signed-off-by: Kuntai Du <[email protected]>

* [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831)

Signed-off-by: jiang1.li <[email protected]>

* [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930)

Signed-off-by: Isotr0py <[email protected]>

* [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920)

Signed-off-by: Ren MinMin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>

* [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939)

Signed-off-by: Travis Johnson <[email protected]>

* [mypy] Fix mypy warnings in api_server.py (vllm-project#11941)

Signed-off-by: Fred Reiss <[email protected]>

* [ci] fix broken distributed-tests-4-gpus (vllm-project#11937)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672)

Signed-off-by: Sungjae Lee <[email protected]>

* [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921)

Signed-off-by: shaochangxu.scx <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>

* [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Basic guide for writing unit tests for new models (vllm-project#11951)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix RobertaModel loading (vllm-project#11940)

Signed-off-by: NickLucche <[email protected]>

* [Model] Add cogagent model support vLLM (vllm-project#11742)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <[email protected]>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <[email protected]>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: [email protected] <[email protected]>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: [email protected] <[email protected]>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <[email protected]>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <[email protected]>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <[email protected]>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <[email protected]>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <[email protected]>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <[email protected]>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <[email protected]>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <[email protected]>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <[email protected]>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <[email protected]>

* Using list

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Multi-lingual P3L (#356)

* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* Trying to make scales work with compileable attention

* Docs lint

* linter formatting bug fixes

* inherit config file updates under fused_moe from main branch.

* match tests for the MOE layers with main.

---------

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Jiaxin Shan <[email protected]>
Signed-off-by: lucast2021 <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Signed-off-by: Alex He <[email protected]>
Signed-off-by: ccjincong <[email protected]>
Signed-off-by: Erez Schwartz <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: rajveerb <[email protected]>
Signed-off-by: hjwei <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: Liangfu Chen <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: Matthias Vogler <[email protected]>
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Kazuhiro Serizawa <[email protected]>
Signed-off-by: Tobias Pitters <[email protected]>
Signed-off-by: Kathy Yu <[email protected]>
Signed-off-by: bjmsong <[email protected]>
Signed-off-by: wchen61 <[email protected]>
Signed-off-by: ZincCat <[email protected]>
Signed-off-by: xcnick <[email protected]>
Signed-off-by: Yan Burman <[email protected]>
Signed-off-by: Ido Asraff <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Suraj Deshmukh <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Nishidha Panpaliya <[email protected]>
Signed-off-by: Ilya Lavrenov <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Ye Qi <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Kuntai Du <[email protected]>
Signed-off-by: Ren MinMin <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Fred Reiss <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: shaochangxu.scx <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Chenguang Li <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: elijah <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Rui Qiao <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Lucas Tucker <[email protected]>
Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: robertgshaw2-neuralmagic <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: AlexHe99 <[email protected]>
Co-authored-by: Chen1022 <[email protected]>
Co-authored-by: ErezSC42 <[email protected]>
Co-authored-by: Selali <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Rajveer Bachkaniwala <[email protected]>
Co-authored-by: hj-wei <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: whyiug <[email protected]>
Co-authored-by: Kevin H. Luu <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: John Giorgi <[email protected]>
Co-authored-by: sakunkun <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Yihua Cheng <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Lu Fang <[email protected]>
Co-authored-by: Kazuhiro Serizawa <[email protected]>
Co-authored-by: Tobias Pitters <[email protected]>
Co-authored-by: Chunyang Wen <[email protected]>
Co-authored-by: Kathy Yu <[email protected]>
Co-authored-by: bjmsong <[email protected]>
Co-authored-by: bjmsong <[email protected]>
Co-authored-by: wchen61 <[email protected]>
Co-authored-by: Nathan Azrak <[email protected]>
Co-authored-by: Sachin Varghese <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: ZincCat <[email protected]>
Co-authored-by: WangErXiao <[email protected]>
Co-authored-by: Hust_YangXian <[email protected]>
Co-authored-by: Alberto Ferrer <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: xcnick <[email protected]>
Co-authored-by: Yan Burman <[email protected]>
Co-authored-by: cennn <[email protected]>
Co-authored-by: Lancer <[email protected]>
Co-authored-by: Lancer <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Suraj Deshmukh <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Concurrensee <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: jiangjiadi <[email protected]>
Co-authored-by: jiadi.jjd <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Jie Fu (傅杰) <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Nishidha <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Ye (Charlotte) Qi <[email protected]>
Co-authored-by: yeq <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Yida Wu <[email protected]>
Co-authored-by: Charles Frye <[email protected]>
Co-authored-by: minmin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Fred Reiss <[email protected]>
Co-authored-by: Sungjae Lee <[email protected]>
Co-authored-by: shaochangxu <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>
Co-authored-by: Nicolò Lucchesi <[email protected]>
Co-authored-by: sixgod <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: Akshat Tripathi <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Avshalom Manevich <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: Siyuan Li <[email protected]>
Co-authored-by: Chenguang Li <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Shanshan Shen <[email protected]>
Co-authored-by: elijah <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: vllmellm <[email protected]>
rasmith pushed a commit to rasmith/vllm that referenced this pull request Jan 30, 2025
Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Feb 2, 2025
Co-authored-by: Woosuk Kwon <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
mzusman pushed a commit to mzusman/vllm that referenced this pull request Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants