-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Description
We summarize the issues we received and our planned features in this issue. This issue will keep being updated.
Latest issue tracked: #677
Software Quality
- Code formater Add code formatting script & Add CI to check code format #57
- Tests for model correctness Add tests for models #101
- Tests for samplers Add tests for sampler #108
- Pypi CD Add CD to PyPI #97
- CI
Installation
- CUDA version Build failure due to CUDA version mismatch #129
- Pre-built CUDA Wheels Publish wheels with pre-built CUDA binaries #139 Request for creation of a wheel for vllm #695
- Support ROCM Installing with ROCM #621
- Windows/WSL installation Bug: Windows installation #179 WSL Ubuntu installation issue #192
- H100 Add support for H100 #199 RuntimeError: attn_bias is not correctly aligned #407
- Support CUDA 12 cuda 12 #385
- Dockerfile feature request: Dockerfile #390
- All other issues with
Installation
label
Documentation
- Documentation CD
- Documentation on LLMEngine and AsyncLLMEngine
- Documentation on user interfaces and the APIs How to set ParallelConfig and SchedulerConfig? #361 Where is the API reference? #395
- Documentation on distributed execution Documentation on distributed execution #206 When can I support multi graphics cards? #228 model parallelism #243 Multi-GPU inference and Specify which GPUs to be used during inference #250 多gpus如何使用? #581
- More detailed guide on adding a new model (possibly simplification in code). Especially how to modify the
forward
function. How integrate with hf with minial modification? #242 - Include latency benchmark results.
- On memory usage. Question regarding the nearly double GPU memory consumption. #241 GPU consumption #550
- How to specify which GPU to use How to specify which gpu to use? #691 os.environ['CUDA_VISIBLE_DEVICES'] = '1' does not work in jupyter #571
New Models
Decoder-only models
- BLOOM Support BLOOM #61
- Falcon Support for Falcon-7B / 40B models #195 Any plans to support Falcon? #197 Anyone adapting falcon 40B&7B models now? #356
- GPT-J Add support for GPTJ #198
- MPT Support for MPT-7B and MPT-30B #218 feature request: support mpt-30b #332
- LongChat Support for longchat-7b-16k #358
- Baichuan-7B why not support baichuan-7b? #303 baichuan-7b return value of apiserver is garbled #400 Support for baichuan models #428
- Baichuan-13B
- LLaMA-2 Support LLaMA-2 #501
Encoder-decoder models
- Whisper Whisper support #180
- T5 Adding support for encoder-decoder models, like T5 or BART #187 Support for fastchat-t5-3b-v1.0 #223 T5 model support #404 Finetuned Flan-T5 #434 T5 like encoder-decoder model support #668
- BART Adding support for encoder-decoder models, like T5 or BART #187
- GLM Support for chatglm-6b #231 when to support chatglm2-6b? #247
Other techniques:
- Quantized models: see Kernels/Quantized PagedAttention
- LoRA: Would it be possible to support LoRA fine-tuned models? #182
- Multi-modal models: [Question] Usage with Multimodal LLM #307
Frontend Features
vLLM demo frontends:
- List of inputs as OpenAI input Langchain passes
prompt
as alist
instead ofstr
#186 Possibility of Passing Prompts as List[str] to AsyncEngine.generate() #279 - Echo Implementing Echo in OpenAI endpoint #201
- Support
ChatCompletion
Endpoint SupportChatCompletion
Endpoint in OpenAI demo server #311 - Use soft embeddings as input does vicuna support embedding input? #369
- Support
logit_bias
[Feature] Add support forlogit_bias
#379 I want use the function prefix_allowed_tokens_fn of huggingface model.generate(), where of vllm's source code shall I modify? #415 - User-defined conversation template feature request: Support user-defined conversation template #408
- Specify GPU to run on How to specify which GPU the model inference on? #352 Specify GPUs bug (torch.distributed.all_reduce(torch.zeros(1).cuda())) #470
Integration with other frontends:
- FastChat (merged)
- Ray Serve (merged)
- NVIDIA Triton NVIDIA Triton support #541
- SkyPilot
- LangChain (Support from LangChain) LangChain and LlamaIndex support #233 Langchain passes
prompt
as alist
instead ofstr
#186 Langchain/LLAMA_INDEX #553
Engine Optimization and New Features
- Smoothen the process of adding a new model Support custom models #112 Require a "Wrapper" feature #258 Best effort support for all Hugging Face transformers models #616
- User-specified tokenizer Support custom tokenizer #111 Why vllm does not support Chinese input #246 How to mannually Set use_fast for tokenizer to False? #259 The hf-internal-testing/llama-tokenizer do not support Chinese prompt #270 garbage output from h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b #281
- Implement models in C++ to reduce Python overhead Modify the current PyTorch model to C++ #42 Tensor Parallelism vs Data Parallelism #367
- Pipeline parallel support pipeline parallel support in the future? #387
- Prefix sharing support Question about efficient memory sharing (prefix sharing) #227
- Clasifier Free Guidance Is there a way to add classifier free guidance (CFG) to vllm while maintaining super fast inference? #620
- Speculative decoding Scope for assisted generation? #439
- Distributed inference with other frameworks Remove Ray for the dependency #208 question: Is it possible to avoid ray in single machine multiple GPUs serving? #391 Support Kuberenetes for Distributed Serving #457
- Better model loading Faster model loading #474 Increase code robustness #519 Llama2 answers is noise #615
- More flexible stop criteria Support custom stop function? #551
- Random Python overheads Consider optimizing the API server #580
Kernels
- Multi-query attention How does this compare to MQA (multi-query attention)? #169
- PagedAttention kernel with multiple query positions. Fix the rushed out multi-query kernel #44
- Quantized PagedAttention GPTQ / Quantization support? #174 What is the correct way to use quantized versions of vicuna or guanco? #210
8-bit quantization
support #214 Not able to used qlora models with vllm #252 8bit support #295 support for quantized models? #316 Loading quantized models #392 - Sampling kernels Implement custom kernels for top-k and top-p sampling #125 Question about sampler. It takes too much time #249
- Condensed RotaryEmbeddings Support for Condensed RotaryEmbeddings #333 supporting superhot models? #388 RoPE scaling support? #464 Request: NTK rope support #479 Does vllm support vicuna-13b-v1.5-16k ? #674 Add AliBi context scaling into vllm for Baichuan13B #686
- Flash Attention V2 Flash Attention V2 #485
- FP8 Kernel TE FP8 support? #448
Bugs
- Floating point comparison Dangerous floating point comparison #71
- Check input length Check whether the input request is too long #113 Prompt size limits? It keeps hanging with prompts longer than 120 tokens #276 Long context will cause the vLLM stop #286 scheduler max-length #447
- Do not init process groups when using a single GPU Do not initialize process group when using a single GPU #117 How to initialize two LLMs in one service? #565 Running two different models on the same machine #654
- Ray tensor parallel bugs ray OOM in tensor parallel #322 Stuck while inferring with WizardCoder model #366 [MPT-30B] OutOfMemoryError: CUDA out of memory #372 Cuda failure 'peer access is not supported between these two devices' #406
- Performance comparison with TGI TGI performance is better than vllm on A800 #262 higher latency than TGI #335 Outdated benchmarks #381
- All other issues with
Bug
label
WoosukKwon, scv119, LiuXiaoxuanPKU, alanxmay, lin72h and 45 moreWoosukKwon, scv119, LiuXiaoxuanPKU, LucienShui, dulalbert and 17 moreWoosukKwon, alanxmay, lin72h, 929359291, HSQ79815 and 16 more
Metadata
Metadata
Assignees
Labels
No labels