high throughput inference

Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like
1. Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
2. Demonstrate compelling perf for weight only int8 gemm at a variety of batch sizes.
3. Demonstrate compelling perf for weight only intX gemm at low batch sizes.
4. Demonstrate compelling perf for weight intX, activation fp8 at a variety of batch sizes.

We could as a baseline extend gpt-fast to work with bs=n w/o doing any kv cache management work and measure perf there. Copying feedback as is, open to discussing more and adding more details as time progresses

EDIT: gpt-fast already has a batched generation branch by Horace https://github.com/pytorch-labs/gpt-fast/tree/batched_generation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

high throughput inference #663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

high throughput inference #663

Description

Activity

msaroufim commented on Aug 13, 2024

jeromeku commented on Aug 13, 2024

vkuzo commented on Aug 13, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions