Skip to content

Response API Error 500 when telemetry enabled and using gemini models #3420

@mhdawson

Description

@mhdawson

System Info

  • LlamaStack Version: 0.2.18 (distribution-starter)

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Summary

LlamaStack's responses API fails with 'ModelResponseStream' object has no attribute 'usage' when using Gemini models,
preventing the API from functioning properly.

Environment

  • LlamaStack Version: 0.2.18 (distribution-starter)
  • Affected Models: All Gemini models (tested with gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro)
  • Provider: remote::vertexai
  • API Endpoint: /v1/openai/v1/responses
  • Deployment: Kubernetes/OpenShift

Steps to Reproduce

  1. Configure LlamaStack with a Gemini model using the vertexai provider
  2. Enable telemetry (default configuration)
  3. Make a request to the responses API:
import openai
client = openai.OpenAI(base_url="http://llamastack:8321/v1/openai/v1")
response = client.responses.create(
    input=[{"role": "user", "content": "Hello", "type": "message"}],
    model="appeng-ai-quickstarts-vertexai/vertex_ai/gemini-2.0-flash",
    stream=False
)

Expected Behavior

  • Responses API should return a successful response object
  • Telemetry should handle missing usage attributes gracefully

Actual Behavior

  • HTTP 500 Internal Server Error
  • Server logs show: 'ModelResponseStream' object has no attribute 'usage'
  • API request fails completely

Root Cause Analysis

The issue occurs in llama_stack/core/routers/inference.py in the openai_chat_completion method (lines ~532-536):

if self.telemetry:
    metrics = self._construct_metrics(
        prompt_tokens=response.usage.prompt_tokens,      # ← FAILS HERE
        completion_tokens=response.usage.completion_tokens,
        total_tokens=response.usage.total_tokens,
        model=model_obj,
    )

Problem: The code unconditionally accesses response.usage attributes for telemetry logging, but Gemini's ModelResponseStream objects do not provide a usage attribute.

Additional locations with same issue:

  • Line ~428-430 in openai_completion method
  • Line ~801-805 in streaming section

Impact

  • Severity: High - Completely blocks responses API usage with Gemini models
  • Scope: Affects all Gemini model variants when telemetry is enabled
  • Workaround: Disable telemetry entirely (removes observability)

Proposed Fix

Add defensive checks before accessing usage attributes:

if self.telemetry and hasattr(response, 'usage') and response.usage is not None:
    metrics = self._construct_metrics(
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        total_tokens=response.usage.total_tokens,
        model=model_obj,
    )
    # ... rest of telemetry logic

Apply this pattern to all locations that access response.usage or chunk.usage.

Additional Context

  • OpenAI chat completions API works fine with the same Gemini models
  • Issue is specific to responses API internal implementation
  • Both LlamaStack and OpenAI clients hit the same server-side error
  • Telemetry disabling workaround confirmed - removing telemetry from config resolves the issue

Test Case

# This should work without throwing AttributeError
response = client.responses.create(
    input=[{"role": "user", "content": "test", "type": "message"}],
    model="appeng-ai-quickstarts-vertexai/vertex_ai/gemini-2.0-flash"
)
assert response.status == "completed"

Files to Modify

  • llama_stack/core/routers/inference.py (primary fix location)
  • Any other files that unconditionally access .usage attributes

Error logs


  INFO     2025-09-11 16:52:02,421 console_span_processor:62 telemetry: 16:52:02.365 [INFO]

           LiteLLM completion() model= gemini-2.0-flash; provider = vertex_ai

  INFO     2025-09-11 16:52:02,429 console_span_processor:39 telemetry: 16:52:02.422 [END]
  InferenceRouter.openai_chat_completion [StatusCode.OK]
           (55.76ms)

  INFO     2025-09-11 16:52:02,430 console_span_processor:48 telemetry: output: <async_generator object

           InferenceRouter.stream_tokens_and_compute_metrics_openai_chat at 0x7f0c3c29f4c0>

  ERROR    2025-09-11 16:52:02,991 __main__:253 server: Error executing endpoint route='/v1/openai/v1/responses' method='post':
   'ModelResponseStream'
           object has no attribute 'usage'

  INFO     2025-09-11 16:52:02,992 uvicorn.access:473 uncategorized: 10.131.0.115:50470 - "POST /v1/openai/v1/responses
  HTTP/1.1" 500
  INFO     2025-09-11 16:52:03,001 console_span_processor:39 telemetry: 16:52:02.994 [END]
  InferenceRouter.stream_tokens_and_compute_metrics_openai_chat
           [StatusCode.OK] (561.84ms)

  INFO     2025-09-11 16:52:03,002 console_span_processor:48 telemetry: chunk_count: 4

  INFO     2025-09-11 16:52:03,009 console_span_processor:39 telemetry: 16:52:03.004 [END] /v1/openai/v1/responses
  [StatusCode.OK] (689.96ms)
  INFO     2025-09-11 16:52:03,010 console_span_processor:48 telemetry: raw_path: /v1/openai/v1/responses

  INFO     2025-09-11 16:52:03,011 console_span_processor:62 telemetry: 16:52:02.992 [ERROR] Error executing endpoint
  route='/v1/openai/v1/responses'
           method='post': 'ModelResponseStream' object has no attribute 'usage'

  INFO     2025-09-11 16:52:03,012 console_span_processor:62 telemetry: 16:52:02.993 [INFO] 10.131.0.115:50470 - "POST
  /v1/openai/v1/responses
           HTTP/1.1" 500

Expected behavior

  • Responses API should return a successful response object
  • Telemetry should handle missing usage attributes gracefully

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions