feat: enable Runpod inference adapter #3707

justinwlin · 2025-10-06T19:03:53Z

What does this PR do?

Sorry to @mattf I thought I could close the other PR and reopen it.. But I didn't have the option to reopen it now. I just didn't want it to keep notifying maintainers if I would make other commits for testing.

Continuation of: #3641

PR fixes Runpod Adapter
#3517

What I fixed from before:

Continuation of: #3641

Made it all OpenAI
Fixed the class up since the OpenAIMixin had a couple changes with the pydantic base model stuff.
Test to make sure that we could dynamically find models and use the resulting identifier to make requests

curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"

Test Plan

# RunPod Provider Quick Start

## Prerequisites
- Python 3.10+
- Git
- RunPod API token

## Setup for Development

```bash
# 1. Clone and enter the repository
cd (into the repo)

# 2. Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Remove any existing llama-stack installation
pip uninstall llama-stack llama-stack-client -y

# 4. Install llama-stack in development mode
pip install -e .

# 5. Build using local development code
(Found this through the Discord)
LLAMA_STACK_DIR=. llama stack build

# When prompted during build:
# - Name: runpod-dev
# - Image type: venv
# - Inference provider: remote::runpod
# - Safety provider: "llama-guard" 
# - Other providers: first defaults

Configure the Stack

The RunPod adapter automatically discovers models from your endpoint via the /v1/models API.
No manual model configuration is required - just set your environment variables.

Run the Server

Important: Use the Build-Created Virtual Environment

# Exit the development venv if you're in it
deactivate

# Activate the build-created venv (NOT .venv)
cd (lama-stack folder github repo)
source llamastack-runpod-dev/bin/activate

For Qwen3-32B-AWQ Public Endpoint (Recommended)

# Set environment variables
export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1"
export RUNPOD_API_TOKEN="your_runpod_api_key"

# Start server
llama stack run ~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml

Quick Test

1. List Available Models (Dynamic Discovery)

First, check which models are available on your RunPod endpoint:

curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"

Example Response:

{
  "data": [
    {
      "identifier": "qwen3-32b-awq",
      "provider_resource_id": "Qwen/Qwen3-32B-AWQ",
      "provider_id": "runpod",
      "type": "model",
      "metadata": {},
      "model_type": "llm"
    }
  ]
}

Note: Use the identifier value from the response above in your requests below.

2. Chat Completion (Non-streaming)

Replace qwen3-32b-awq with your model identifier from step 1:

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'

3. Chat Completion (Streaming)

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Clean streaming output:

curl -N -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true}' \
  2>/dev/null | while read -r line; do
    echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r '.choices[0].delta.content // empty' 2>/dev/null
  done

Expected Output:

We can just use the default, runpod embedding endpoint for vllm is nothing special and just passes through to vllm

mattf

looking good to me

llama_stack/providers/remote/inference/runpod/runpod.py

justinwlin added 2 commits October 6, 2025 12:35

Update runpod.py

9a2b2e3

Updating since OpenAIMixin is Pydantic Base Model

0ba4cd4

justinwlin requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners October 6, 2025 19:03

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2025

justinwlin added 2 commits October 6, 2025 15:09

clarifying register model

b519434

Remove openai embedding override

412ea00

We can just use the default, runpod embedding endpoint for vllm is nothing special and just passes through to vllm

mattf mentioned this pull request Oct 6, 2025

Standardize Inference Providers to Use OpenAIMixin #3387

Open

mattf requested changes Oct 6, 2025

View reviewed changes

llama_stack/providers/remote/inference/runpod/runpod.py Outdated Show resolved Hide resolved

mattf mentioned this pull request Oct 6, 2025

Cleanup uses of OpenAIMixin, simplify inference adapters #3517

Open

using default register model

1b6b298

mattf changed the title ~~Runpod adapter fix~~ feat: enable Runpod inference adapter Oct 7, 2025

mattf approved these changes Oct 7, 2025

View reviewed changes

leseb approved these changes Oct 7, 2025

View reviewed changes

leseb merged commit 509ac4a into llamastack:main Oct 7, 2025
21 of 22 checks passed

leseb mentioned this pull request Oct 9, 2025

feat: add OpenAI-compatible Bedrock provider #3748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable Runpod inference adapter #3707

feat: enable Runpod inference adapter #3707

Uh oh!

justinwlin commented Oct 6, 2025

Uh oh!

mattf left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: enable Runpod inference adapter #3707

feat: enable Runpod inference adapter #3707

Uh oh!

Conversation

justinwlin commented Oct 6, 2025

What does this PR do?

What I fixed from before:

Test Plan

Configure the Stack

Run the Server

Important: Use the Build-Created Virtual Environment

For Qwen3-32B-AWQ Public Endpoint (Recommended)

Quick Test

1. List Available Models (Dynamic Discovery)

2. Chat Completion (Non-streaming)

3. Chat Completion (Streaming)

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants