-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
- LlamaStack Version: 0.2.18 (distribution-starter)
Information
- The official example scripts
- My own modified scripts
🐛 Describe the bug
Summary
LlamaStack's responses API fails with 'ModelResponseStream' object has no attribute 'usage' when using Gemini models,
preventing the API from functioning properly.
Environment
- LlamaStack Version: 0.2.18 (distribution-starter)
- Affected Models: All Gemini models (tested with gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro)
- Provider: remote::vertexai
- API Endpoint: /v1/openai/v1/responses
- Deployment: Kubernetes/OpenShift
Steps to Reproduce
- Configure LlamaStack with a Gemini model using the vertexai provider
- Enable telemetry (default configuration)
- Make a request to the responses API:
import openai
client = openai.OpenAI(base_url="http://llamastack:8321/v1/openai/v1")
response = client.responses.create(
input=[{"role": "user", "content": "Hello", "type": "message"}],
model="appeng-ai-quickstarts-vertexai/vertex_ai/gemini-2.0-flash",
stream=False
)
Expected Behavior
- Responses API should return a successful response object
- Telemetry should handle missing usage attributes gracefully
Actual Behavior
- HTTP 500 Internal Server Error
- Server logs show: 'ModelResponseStream' object has no attribute 'usage'
- API request fails completely
Root Cause Analysis
The issue occurs in llama_stack/core/routers/inference.py in the openai_chat_completion method (lines ~532-536):
if self.telemetry:
metrics = self._construct_metrics(
prompt_tokens=response.usage.prompt_tokens, # ← FAILS HERE
completion_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens,
model=model_obj,
)
Problem: The code unconditionally accesses response.usage attributes for telemetry logging, but Gemini's ModelResponseStream objects do not provide a usage attribute.
Additional locations with same issue:
- Line ~428-430 in openai_completion method
- Line ~801-805 in streaming section
Impact
- Severity: High - Completely blocks responses API usage with Gemini models
- Scope: Affects all Gemini model variants when telemetry is enabled
- Workaround: Disable telemetry entirely (removes observability)
Proposed Fix
Add defensive checks before accessing usage attributes:
if self.telemetry and hasattr(response, 'usage') and response.usage is not None:
metrics = self._construct_metrics(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens,
model=model_obj,
)
# ... rest of telemetry logic
Apply this pattern to all locations that access response.usage or chunk.usage.
Additional Context
- OpenAI chat completions API works fine with the same Gemini models
- Issue is specific to responses API internal implementation
- Both LlamaStack and OpenAI clients hit the same server-side error
- Telemetry disabling workaround confirmed - removing telemetry from config resolves the issue
Test Case
# This should work without throwing AttributeError
response = client.responses.create(
input=[{"role": "user", "content": "test", "type": "message"}],
model="appeng-ai-quickstarts-vertexai/vertex_ai/gemini-2.0-flash"
)
assert response.status == "completed"
Files to Modify
- llama_stack/core/routers/inference.py (primary fix location)
- Any other files that unconditionally access .usage attributes
Error logs
INFO 2025-09-11 16:52:02,421 console_span_processor:62 telemetry: 16:52:02.365 [INFO]
LiteLLM completion() model= gemini-2.0-flash; provider = vertex_ai
INFO 2025-09-11 16:52:02,429 console_span_processor:39 telemetry: 16:52:02.422 [END]
InferenceRouter.openai_chat_completion [StatusCode.OK]
(55.76ms)
INFO 2025-09-11 16:52:02,430 console_span_processor:48 telemetry: output: <async_generator object
InferenceRouter.stream_tokens_and_compute_metrics_openai_chat at 0x7f0c3c29f4c0>
ERROR 2025-09-11 16:52:02,991 __main__:253 server: Error executing endpoint route='/v1/openai/v1/responses' method='post':
'ModelResponseStream'
object has no attribute 'usage'
INFO 2025-09-11 16:52:02,992 uvicorn.access:473 uncategorized: 10.131.0.115:50470 - "POST /v1/openai/v1/responses
HTTP/1.1" 500
INFO 2025-09-11 16:52:03,001 console_span_processor:39 telemetry: 16:52:02.994 [END]
InferenceRouter.stream_tokens_and_compute_metrics_openai_chat
[StatusCode.OK] (561.84ms)
INFO 2025-09-11 16:52:03,002 console_span_processor:48 telemetry: chunk_count: 4
INFO 2025-09-11 16:52:03,009 console_span_processor:39 telemetry: 16:52:03.004 [END] /v1/openai/v1/responses
[StatusCode.OK] (689.96ms)
INFO 2025-09-11 16:52:03,010 console_span_processor:48 telemetry: raw_path: /v1/openai/v1/responses
INFO 2025-09-11 16:52:03,011 console_span_processor:62 telemetry: 16:52:02.992 [ERROR] Error executing endpoint
route='/v1/openai/v1/responses'
method='post': 'ModelResponseStream' object has no attribute 'usage'
INFO 2025-09-11 16:52:03,012 console_span_processor:62 telemetry: 16:52:02.993 [INFO] 10.131.0.115:50470 - "POST
/v1/openai/v1/responses
HTTP/1.1" 500
Expected behavior
- Responses API should return a successful response object
- Telemetry should handle missing usage attributes gracefully