Skip to content

Conversation

mostlygeek
Copy link
Owner

@mostlygeek mostlygeek commented Sep 6, 2025

Add support for llama.cpp's new prompt metrics (ggml-org/llama.cpp#15827).

image

Summary by CodeRabbit

  • New Features

    • Added “Cached” token metric across the system; recorded when available and shown in Activity.
    • New Activity columns: Cached, Prompt, and Generated with explanatory tooltips.
  • Improvements

    • Activity timestamps now display as relative time (e.g., “5m ago”) for easier scanning.
    • Header labels refined (e.g., “ID”, “Time”) and column alignments adjusted for readability.
    • Conditional display for cached tokens (“-” when not available) to reduce noise.
  • Chores

    • Minor whitespace cleanup with no functional impact.

Copy link

coderabbitai bot commented Sep 6, 2025

Walkthrough

Adds a new cached token metric across proxy and UI: parses cache_n from response timings, stores it in TokenMetrics as CachedTokens/cache_tokens, records it in middleware, and displays it in Activity with new columns and relative time formatting. No control-flow or error-handling changes.

Changes

Cohort / File(s) Summary of changes
Proxy metrics parsing & model
proxy/metrics_middleware.go, proxy/metrics_monitor.go
Parse timings.cache_n into a local cachedTokens (default -1) and include it when recording TokenMetrics. Added public field CachedTokens int (JSON cache_tokens) to TokenMetrics. Minor whitespace reflow.
UI types
ui/src/contexts/APIProvider.tsx
Extended Metrics interface with cache_tokens: number. No runtime logic changes.
UI Activity page
ui/src/pages/Activity.tsx
Replaced absolute timestamp with relative time. Added columns: Cached (cache_tokens or "-"), Prompt (input_tokens), Generated (output_tokens). Updated headers (ID, Time) and tooltips. Introduced local Tooltip component. Adjusted metric labels/alignments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant UI as Activity Page (UI)
  participant API as Proxy
  participant MM as Metrics Middleware
  participant Mon as Metrics Monitor

  User->>UI: Open Activity
  UI->>API: GET /metrics
  API->>MM: Handle request
  Note over MM: Parse response timings<br/>Read cache_n → cachedTokens (-1 if absent)
  MM->>Mon: Record TokenMetrics{..., CachedTokens}
  Mon-->>API: Metrics snapshot (includes cache_tokens)
  API-->>UI: JSON metrics list
  UI->>UI: Render table: Cached, Prompt, Generated, Speed, Duration, Time (relative)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch support-llama-cpp-cache-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
proxy/metrics_monitor.go (1)

16-16: Document sentinel semantics for CachedTokens

Clarify that -1 means “unknown” to keep API consumers aligned and avoid misinterpretation.

 type TokenMetrics struct {
   ID              int       `json:"id"`
   Timestamp       time.Time `json:"timestamp"`
   Model           string    `json:"model"`
-  CachedTokens    int       `json:"cache_tokens"`
+  // CachedTokens is the number of prompt tokens served from the KV cache. -1 = unknown.
+  CachedTokens    int       `json:"cache_tokens"`
   InputTokens     int       `json:"input_tokens"`
   OutputTokens    int       `json:"output_tokens"`
   PromptPerSecond float64   `json:"prompt_per_second"`
   TokensPerSecond float64   `json:"tokens_per_second"`
   DurationMs      int       `json:"duration_ms"`
 }
ui/src/contexts/APIProvider.tsx (1)

31-31: Annotate unknown sentinel on cache_tokens

Backend uses -1 for unknown; add a short note so callers don’t assume non-negative only.

   model: string;
-  cache_tokens: number;
+  // -1 => unknown (timings not provided by upstream)
+  cache_tokens: number;
proxy/metrics_middleware.go (1)

75-76: Also accept upstream’s legacy tokens_cached (back-compat) and keep cache_n

Several llama.cpp server responses expose a top-level tokens_cached instead of timings.cache_n. Parse both to cover older and newer servers. Examples with tokens_cached are present in real server outputs. (github.com)

   // default values
-  cachedTokens := -1 // unknown or missing data
+  cachedTokens := -1 // unknown or missing data
@@
-    if cachedValue := jsonData.Get("timings.cache_n"); cachedValue.Exists() {
-      cachedTokens = int(cachedValue.Int())
-    }
+    // newer: timings.cache_n ; older: top-level tokens_cached
+    if cachedValue := jsonData.Get("timings.cache_n"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    } else if cachedValue := jsonData.Get("tokens_cached"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    }
   }
+
+  // if timings block was absent, still try legacy top-level field
+  if cachedTokens < 0 {
+    if cachedValue := jsonData.Get("tokens_cached"); cachedValue.Exists() {
+      cachedTokens = int(cachedValue.Int())
+    }
+  }
@@
   rec.metricsMonitor.addMetrics(TokenMetrics{
     Timestamp:       time.Now(),
     Model:           rec.realModelName,
     CachedTokens:    cachedTokens,
     InputTokens:     inputTokens,

Also applies to: 97-100, 102-111

ui/src/pages/Activity.tsx (3)

18-20: Align copy with intent (“just now”)

Text says “just now” in comment; return value was “now”. Low-risk UX consistency fix.

-  if (diffInSeconds < 5) {
-    return "now";
-  }
+  if (diffInSeconds < 5) {
+    return "just now";
+  }

79-81: Show 0 cached tokens explicitly; reserve “-” for unknown

Distinguishes “no cache hits” (0) from “not reported” (-1).

-    {metric.cache_tokens > 0 ? metric.cache_tokens.toLocaleString() : "-"}
+    {metric.cache_tokens >= 0 ? metric.cache_tokens.toLocaleString() : "-"}

101-119: Make Tooltip keyboard-accessible and screen-reader friendly

Add focus handling and ARIA to improve a11y without changing visuals.

-const Tooltip: React.FC<TooltipProps> = ({ content }) => {
+const Tooltip: React.FC<TooltipProps> = ({ content }) => {
   return (
-    <div className="relative group inline-block">
+    <div className="relative group inline-block" tabIndex={0} aria-label={content}>
       ⓘ
       <div
-        className="absolute top-full left-1/2 transform -translate-x-1/2 mt-2
-                     px-3 py-2 bg-gray-900 text-white text-sm rounded-md
-                     opacity-0 group-hover:opacity-100 transition-opacity
-                     duration-200 pointer-events-none whitespace-nowrap z-50 normal-case"
+        className="absolute top-full left-1/2 transform -translate-x-1/2 mt-2
+                     px-3 py-2 bg-gray-900 text-white text-sm rounded-md
+                     opacity-0 group-hover:opacity-100 group-focus:opacity-100 transition-opacity
+                     duration-200 pointer-events-none whitespace-nowrap z-50 normal-case"
       >
         {content}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 954e2de and a023624.

📒 Files selected for processing (4)
  • proxy/metrics_middleware.go (2 hunks)
  • proxy/metrics_monitor.go (1 hunks)
  • ui/src/contexts/APIProvider.tsx (1 hunks)
  • ui/src/pages/Activity.tsx (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
ui/src/contexts/APIProvider.tsx (1)
ui/src/pages/Models.tsx (5)
  • totalRequests (136-145)
  • StatsPanel (133-173)
  • sum (143-143)
  • sum (141-141)
  • sum (142-142)
proxy/metrics_middleware.go (1)
proxy/metrics_monitor.go (1)
  • TokenMetrics (12-22)
ui/src/pages/Activity.tsx (1)
ui/src/pages/Models.tsx (3)
  • StatsPanel (133-173)
  • totalRequests (136-145)
  • sum (143-143)

@mostlygeek mostlygeek merged commit f58c8c8 into main Sep 6, 2025
3 checks passed
@mostlygeek mostlygeek deleted the support-llama-cpp-cache-metrics branch September 6, 2025 20:58
mcowger pushed a commit to mcowger/llama-swap that referenced this pull request Sep 8, 2025
Capture prompt cache metrics and surface them on Activities page in UI
mcowger added a commit to mcowger/llama-swap that referenced this pull request Sep 9, 2025
* Add a config editor page

* Support llama.cpp's cache_n in timings info (mostlygeek#287)

Capture prompt cache metrics and surface them on Activities page in UI

* Fix mostlygeek#288 Vite hot module reloading creating multiple SSE connections (mostlygeek#290)

- move SSE (EventSource) connection to module level
- manage EventSource as a singleton, closing open connection before
  reopening a new one

* Add model name copy button to Models UI

---------

Co-authored-by: Benson Wong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant