-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed as not planned
Labels
performancePerformance-related issuesPerformance-related issues
Description
Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.
Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.
Potential sources of overheads:
- Python v.s. C++.
- PyTorch (even in C++) v.s. FasterTransformer.
How to implement a C++ version:
- (Fake C++) Torch compiler (torch.jit).
- Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
- Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.
alanxmay, zhangxy9999, inkinworld, SuperCB, luzai and 2 more
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues