Live LLM

An experimental LLM runtime that supports live KV cache streaming for conversational AI.
Instead of re-sending the entire prompt and conversation history at every step, the client only sends incremental state updates that keep the model’s KV cache hot.

This allows the LLM to respond faster and more interactively, at the expense of eventually forgetting the oldest parts of a conversation. Think of it as a rolling window of conversational memory that trades completeness for latency.

This is an actual demo of Gemma 270m running on CPU, with instantaneous, stateful responses since the KV cache is prepopulated with prompt and context.

Use Cases

Real-time Speech / TTS Agents
- When paired with ASR + TTS, the hot cache enables fast, incremental responses
- Useful for low-latency voice assistants on edge devices
- Can reset or swap KV cache to "forget" or switch to new conversational context
Task oriented, assistant-type LLMs
- Chatbots and assistants that need snappy replies within a given task without extensive context
- Interactive storytelling or character responses
Prototyping & Research
- Great for experimenting with streaming inference and incremental context

Why Live KV Cache?

Traditional LLM runtimes rebuild the whole context on every request, which is expensive and slow.
With a persistent key-value cache, each new message only needs to process the delta (the new tokens), which gives:

Low Latency Responses — No need to re-encode all prior messages
Lower Compute Cost — Especially helpful on CPU-bound inference
Interactive Streaming — Feels more like a real conversation than a batch job
Composable — Clients can decide how much history to keep vs. reset

Advantages

Speed: Incremental generation is much faster on constrained hardware
Memory Efficiency: Keeps just enough conversation history, instead of rebuilding context
Simple Integration: WebSocket-based clients, easy to drop into other projects
Small-Model Friendly: Runs with less then 1s to first-token with Gemma 3 270M on CPU (no GPU required)
Resettable: Cache can be cleared mid-conversation for a fresh state
Context switchable: Cache can be swapped mid-conversation for a new conversational goal

Disadvantages / Trade-offs

Single-Tenant: Designed for one conversation at a time (this is why we use small models that can scale with each ongoing conversation)
Forgets Old Context: Since the cache is finite, earlier conversation turns are eventually dropped
Cache Invalidation Bugs: Care must be taken to avoid mismatched inputs vs. cached states
Less Accurate Long-Form Reasoning: Without full history, answers to very long threads may drift

Why This Repo Is Useful

Demonstrates how to wire up a live KV cache in practice
Provides a minimal, working FastAPI + WebSocket server for streaming inference
Showcases running Gemma 3 270M in a real-time loop on CPU
Serves as a reference architecture for larger projects that want to experiment with:
- Edge inference (Jetson, Raspberry Pi, etc.)
- Realtime agents (voice, chat, embedded systems)
- LLM-backed multiplayer / multi-agent systems

If you’ve ever wanted to see how incremental LLM state can be managed outside of “batch mode,” this project is a solid starting point.

Features

Hot KV Cache — Maintains conversation state for fast response times
Live Streaming — Real-time token streaming via WebSockets
Modular Architecture — Separate server, input client, and output client
Gemma 3 270M Support — Small, efficient model for CPU inference
WebSocket API — Clean separation between input and output

Environment Setup

🤗 Huggingface token

You will need a 🤗 Hugging Face token that allows read access to gemma:270m before proceeding.

Log into huggingface with your account
Access gemma:270m and accept any agreements
Click on your profile | Access Tokens | Create new token
Create a fine-grained access token with "Read access to contents of all public gated repos you can access" or any other permission you choose
Copy the generated token and save it as a .env file with the following contents

HUGGINGFACE_TOKEN=hf_mytokenhere

You are now ready to use VS Code to develop with this repo.

Dev Container (Recommended)

We use VS Code Dev Containers to ensure a consistent, isolated development environment.

Install prerequisites
- Install Docker
- Install Visual Studio Code
- Install the Dev Containers extension Installation guide: VS Code Dev Containers Guide
Open in Dev Container
- Open this project folder in VS Code
- Press F1 (or Ctrl/Cmd+Shift+P) and run Dev Containers: Reopen in Container
- VS Code will detect the .devcontainer/ config, build the container, and reopen inside it
Rebuild if Needed
- If you make changes to .devcontainer/devcontainer.json or Docker setup, press Ctrl/Cmd+Shift+P → Dev Containers: Rebuild Container

Debugging and running

Use the VS Code Ruyn and Debug panel to launch the compound configuration Live LLM (web). This will launch:

The Live-LLM server - This may take some time to start the first time as it may need to download the models we use. When ready, you will see

INFO: Application startup complete.

The web client - This will launch in a new browser window. Once the server is running, you can refresh this window to connect to the server.

At this point you can

Create breakpoints
Modify the server (it will auto-reload)
Build the client (Ctrl/Cmd+Shift+B) and refresh it (Shift+F5) to reload changes, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
.vscode		.vscode
client		client
img		img
live_llm		live_llm
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Live LLM

Use Cases

Why Live KV Cache?

Advantages

Disadvantages / Trade-offs

Why This Repo Is Useful

Features

Environment Setup

🤗 Huggingface token

Dev Container (Recommended)

Debugging and running

About

Uh oh!

Releases

Packages

Languages

License

aniongithub/live-llm

Folders and files

Latest commit

History

Repository files navigation

Live LLM

Use Cases

Why Live KV Cache?

Advantages

Disadvantages / Trade-offs

Why This Repo Is Useful

Features

Environment Setup

🤗 Huggingface token

Dev Container (Recommended)

Debugging and running

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages