[Roadmap] vLLM Development Roadmap: H2 2023

We summarize the issues we received and our planned features in this issue. This issue will keep being updated.

Latest issue tracked: #677 

## Software Quality

- [x] Code formater #57 
- [x] Tests for model correctness #101 
- [x] Tests for samplers #108 
- [ ] Pypi CD #97 
- [x] CI

## Installation
- [x] CUDA version #129 
- [x] Pre-built CUDA Wheels #139 #695
- [x] Support ROCM #621 
- [x] Windows/WSL installation #179 #192 
- [x] H100 #199 #407 
- [x] Support CUDA 12 #385 
- [x] Dockerfile #390 
- [ ] All other issues with `Installation` label

## Documentation

- [x] Documentation CD
- [x] Documentation on LLMEngine and AsyncLLMEngine
- [x] Documentation on user interfaces and the APIs #361 #395
- [x] Documentation on distributed execution #206 #228 #243 #250 #581 
- [x] More detailed guide on adding a new model (possibly simplification in code). Especially how to modify the `forward` function. #242
- [ ] Include latency benchmark results.
- [ ] On memory usage. #241 #550 
- [x] How to specify which GPU to use #691 #571

## New Models

Decoder-only models
- [x] BLOOM #61 
- [x] Falcon #195 #197 #356 
- [x] GPT-J #198 
- [x] MPT #218 #332 
- [x] LongChat #358 
- [x] Baichuan-7B #303 #400 #428 
- [x] Baichuan-13B
- [x] LLaMA-2 #501 

Encoder-decoder models
- [ ] Whisper #180
- [ ] T5 #187 #223 #404 #434 #668 
- [ ] BART #187 
- [x] GLM #231 #247 

Other techniques:
- [x] Quantized models: see Kernels/Quantized PagedAttention
- [x] LoRA: #182 
- [ ] Multi-modal models: #307 

## Frontend Features

vLLM demo frontends:
- [x] List of inputs as OpenAI input #186 #279
- [x] Echo #201 
- [x] Support `ChatCompletion` Endpoint #311 
- [ ] Use soft embeddings as input #369
- [ ] Support `logit_bias` #379 #415 
- [x] User-defined conversation template #408 
- [x] Specify GPU to run on #352 #470

Integration with other frontends:
- [x] FastChat ([merged](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py))
- [x] Ray Serve ([merged](https://github.com/ray-project/ray/blob/master/doc/source/serve/doc_code/vllm_example.py))
- [x] NVIDIA Triton #541 
- [x] SkyPilot
- [x] LangChain ([Support from LangChain](https://python.langchain.com/docs/integrations/llms/vllm)) #233 #186 #553 

## Engine Optimization and New Features

- [ ] Smoothen the process of adding a new model #112 #258 #616
- [x] User-specified tokenizer #111 #246 #259 #270 #281
- [ ] Implement models in C++ to reduce Python overhead #42 #367 
  - [ ] [CTranslate2](https://github.com/OpenNMT/CTranslate2) #211 
  - [ ] [ctransformers](https://github.com/marella/ctransformers)
  - [ ] [exllama](https://github.com/turboderp/exllama) #296 
  - [ ] [ggml](https://github.com/ggerganov/ggml)
- [ ] Pipeline parallel support #387 
- [x] Prefix sharing support #227 
- [ ] Clasifier Free Guidance #620 
- [ ] Speculative decoding #439
- [ ] Distributed inference with other frameworks #208 #391 #457 
- [ ] Better model loading #474 #519 #615
- [x] More flexible stop criteria #551 
- [ ] Random Python overheads #580 

## Kernels

- [x] Multi-query attention #169 
- [x] PagedAttention kernel with multiple query positions. #44 
- [ ] Quantized PagedAttention #174 #210 #214 #252 #295 #316 #392 
- [ ] Sampling kernels #125 #249
- [x] Condensed RotaryEmbeddings #333 #388 #464 #479 #674 #686 
- [x] Flash Attention V2 #485 
- [ ] FP8 Kernel #448

## Bugs

- [x] Floating point comparison #71 
- [x] Check input length #113 #276 #286 #447
- [ ] Do not init process groups when using a single GPU #117 #565 #654 
- [ ] Ray tensor parallel bugs #322 #366 #372 #406 
- [ ] Performance comparison with TGI #262 #335 #381
- [ ] All other issues with `Bug` label

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Software Quality

Installation

Documentation

New Models

Frontend Features

Engine Optimization and New Features

Kernels

Bugs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Description

Software Quality

Installation

Documentation

New Models

Frontend Features

Engine Optimization and New Features

Kernels

Bugs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions