Support llama.cpp's cache_n in timings info #287

mostlygeek · 2025-09-06T19:50:11Z

Add support for llama.cpp's new prompt metrics (ggml-org/llama.cpp#15827).

Summary by CodeRabbit

New Features
- Added “Cached” token metric across the system; recorded when available and shown in Activity.
- New Activity columns: Cached, Prompt, and Generated with explanatory tooltips.
Improvements
- Activity timestamps now display as relative time (e.g., “5m ago”) for easier scanning.
- Header labels refined (e.g., “ID”, “Time”) and column alignments adjusted for readability.
- Conditional display for cached tokens (“-” when not available) to reduce noise.
Chores
- Minor whitespace cleanup with no functional impact.

coderabbitai · 2025-09-06T19:50:17Z

Walkthrough

Adds a new cached token metric across proxy and UI: parses cache_n from response timings, stores it in TokenMetrics as CachedTokens/cache_tokens, records it in middleware, and displays it in Activity with new columns and relative time formatting. No control-flow or error-handling changes.

Changes

Cohort / File(s)	Summary of changes
Proxy metrics parsing & model `proxy/metrics_middleware.go`, `proxy/metrics_monitor.go`	Parse `timings.cache_n` into a local `cachedTokens` (default -1) and include it when recording TokenMetrics. Added public field `CachedTokens int` (JSON `cache_tokens`) to `TokenMetrics`. Minor whitespace reflow.
UI types `ui/src/contexts/APIProvider.tsx`	Extended `Metrics` interface with `cache_tokens: number`. No runtime logic changes.
UI Activity page `ui/src/pages/Activity.tsx`	Replaced absolute timestamp with relative time. Added columns: Cached (`cache_tokens` or "-"), Prompt (`input_tokens`), Generated (`output_tokens`). Updated headers (ID, Time) and tooltips. Introduced local `Tooltip` component. Adjusted metric labels/alignments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant UI as Activity Page (UI)
  participant API as Proxy
  participant MM as Metrics Middleware
  participant Mon as Metrics Monitor

  User->>UI: Open Activity
  UI->>API: GET /metrics
  API->>MM: Handle request
  Note over MM: Parse response timings<br/>Read cache_n → cachedTokens (-1 if absent)
  MM->>Mon: Record TokenMetrics{..., CachedTokens}
  Mon-->>API: Metrics snapshot (includes cache_tokens)
  API-->>UI: JSON metrics list
  UI->>UI: Render table: Cached, Prompt, Generated, Speed, Duration, Time (relative)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Fix token metrics parsing #199 — Also modifies metrics parsing in proxy/metrics_middleware.go around response timings; related to parsing additions.
Add prompt processing metrics #250 — Adds a new metric field to TokenMetrics and wires it through middleware, monitor, and Activity UI, closely matching these changes.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch support-llama-cpp-cache-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

proxy/metrics_monitor.go (1)

16-16: Document sentinel semantics for CachedTokens

Clarify that -1 means “unknown” to keep API consumers aligned and avoid misinterpretation.

 type TokenMetrics struct {
   ID              int       `json:"id"`
   Timestamp       time.Time `json:"timestamp"`
   Model           string    `json:"model"`
-  CachedTokens    int       `json:"cache_tokens"`
+  // CachedTokens is the number of prompt tokens served from the KV cache. -1 = unknown.
+  CachedTokens    int       `json:"cache_tokens"`
   InputTokens     int       `json:"input_tokens"`
   OutputTokens    int       `json:"output_tokens"`
   PromptPerSecond float64   `json:"prompt_per_second"`
   TokensPerSecond float64   `json:"tokens_per_second"`
   DurationMs      int       `json:"duration_ms"`
 }

ui/src/contexts/APIProvider.tsx (1)

31-31: Annotate unknown sentinel on cache_tokens

Backend uses -1 for unknown; add a short note so callers don’t assume non-negative only.
   model: string;
-  cache_tokens: number;
+  // -1 => unknown (timings not provided by upstream)
+  cache_tokens: number;

proxy/metrics_middleware.go (1)

75-76: Also accept upstream’s legacy tokens_cached (back-compat) and keep cache_n

Several llama.cpp server responses expose a top-level tokens_cached instead of timings.cache_n. Parse both to cover older and newer servers. Examples with tokens_cached are present in real server outputs. (github.com)

   // default values
-  cachedTokens := -1 // unknown or missing data
+  cachedTokens := -1 // unknown or missing data
@@
-    if cachedValue := jsonData.Get("timings.cache_n"); cachedValue.Exists() {
-      cachedTokens = int(cachedValue.Int())
-    }
+    // newer: timings.cache_n ; older: top-level tokens_cached
+    if cachedValue := jsonData.Get("timings.cache_n"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    } else if cachedValue := jsonData.Get("tokens_cached"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    }
   }
+
+  // if timings block was absent, still try legacy top-level field
+  if cachedTokens < 0 {
+    if cachedValue := jsonData.Get("tokens_cached"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    }
+  }
@@
   rec.metricsMonitor.addMetrics(TokenMetrics{
     Timestamp:       time.Now(),
     Model:           rec.realModelName,
     CachedTokens:    cachedTokens,
     InputTokens:     inputTokens,

Also applies to: 97-100, 102-111

ui/src/pages/Activity.tsx (3)

18-20: Align copy with intent (“just now”)

Text says “just now” in comment; return value was “now”. Low-risk UX consistency fix.

-  if (diffInSeconds < 5) {
-    return "now";
-  }
+  if (diffInSeconds < 5) {
+    return "just now";
+  }

79-81: Show 0 cached tokens explicitly; reserve “-” for unknown

Distinguishes “no cache hits” (0) from “not reported” (-1).

-    {metric.cache_tokens > 0 ? metric.cache_tokens.toLocaleString() : "-"}
+    {metric.cache_tokens >= 0 ? metric.cache_tokens.toLocaleString() : "-"}

101-119: Make Tooltip keyboard-accessible and screen-reader friendly

Add focus handling and ARIA to improve a11y without changing visuals.

-const Tooltip: React.FC<TooltipProps> = ({ content }) => {
+const Tooltip: React.FC<TooltipProps> = ({ content }) => {
   return (
-    <div className="relative group inline-block">
+    <div className="relative group inline-block" tabIndex={0} aria-label={content}>
       ⓘ
       <div
-        className="absolute top-full left-1/2 transform -translate-x-1/2 mt-2
-                     px-3 py-2 bg-gray-900 text-white text-sm rounded-md
-                     opacity-0 group-hover:opacity-100 transition-opacity
-                     duration-200 pointer-events-none whitespace-nowrap z-50 normal-case"
+        className="absolute top-full left-1/2 transform -translate-x-1/2 mt-2
+                     px-3 py-2 bg-gray-900 text-white text-sm rounded-md
+                     opacity-0 group-hover:opacity-100 group-focus:opacity-100 transition-opacity
+                     duration-200 pointer-events-none whitespace-nowrap z-50 normal-case"
       >
         {content}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 954e2de and a023624.

📒 Files selected for processing (4)

proxy/metrics_middleware.go (2 hunks)
proxy/metrics_monitor.go (1 hunks)
ui/src/contexts/APIProvider.tsx (1 hunks)
ui/src/pages/Activity.tsx (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

ui/src/contexts/APIProvider.tsx (1)

ui/src/pages/Models.tsx (5)

totalRequests (136-145)

StatsPanel (133-173)

sum (143-143)

sum (141-141)

sum (142-142)

proxy/metrics_middleware.go (1)

proxy/metrics_monitor.go (1)

TokenMetrics (12-22)

ui/src/pages/Activity.tsx (1)

ui/src/pages/Models.tsx (3)

StatsPanel (133-173)

totalRequests (136-145)

sum (143-143)

Capture prompt cache metrics and surface them on Activities page in UI

* Add a config editor page * Support llama.cpp's cache_n in timings info (mostlygeek#287) Capture prompt cache metrics and surface them on Activities page in UI * Fix mostlygeek#288 Vite hot module reloading creating multiple SSE connections (mostlygeek#290) - move SSE (EventSource) connection to module level - manage EventSource as a singleton, closing open connection before reopening a new one * Add model name copy button to Models UI --------- Co-authored-by: Benson Wong <[email protected]>

mostlygeek added 6 commits September 6, 2025 12:13

show cached prompt tokens in metrics

570d527

tidy up typescript

9c6193e

add tooltip to headers in activity page

32b73f9

fix typescript build issues

ae004ed

tweak activity now timestamp

3118f86

add a space

a023624

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

mostlygeek merged commit f58c8c8 into main Sep 6, 2025
3 checks passed

mostlygeek deleted the support-llama-cpp-cache-metrics branch September 6, 2025 20:58

mcowger pushed a commit to mcowger/llama-swap that referenced this pull request Sep 8, 2025

Support llama.cpp's cache_n in timings info (mostlygeek#287)

5bbff02

Capture prompt cache metrics and surface them on Activities page in UI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama.cpp's cache_n in timings info #287

Support llama.cpp's cache_n in timings info #287

Uh oh!

mostlygeek commented Sep 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support llama.cpp's cache_n in timings info #287

Support llama.cpp's cache_n in timings info #287

Uh oh!

Conversation

mostlygeek commented Sep 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mostlygeek commented Sep 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 6, 2025 •

edited

Loading