Skip to content

Conversation

Jan-Kazlouski-elastic
Copy link
Contributor

@Jan-Kazlouski-elastic Jan-Kazlouski-elastic commented Jun 26, 2025

Creation of new Llama inference provider integration allowing text_embedding, completion (both streaming and non-streaming) and chat_completion (only streaming) to be executed as part of inference API.

Changes were tested locally against next models:

  • all-MiniLM-L6-v2 (text embedding)
  • llama3.2:3b (completion & chat_completion)

For testing ollama service was used.
Quickstart for setting up running llama service locally: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html

Setup

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Download and execute ollama

https://ollama.com/download

Clone the llama stack repo: git clone [email protected]:meta-llama/llama-stack.git, then follow the detailed instructions in the docs above.

Running `all-minilm:l6-v2`

Download the model:

ollama pull all-minilm:l6-v2
INFERENCE_MODEL=all-minilm:l6-v2 uv run --with llama-stack llama stack build --template starter --image-type venv --run

Examples of RQ/RS from local testing:

Create Embedding Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

No API key (success):

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/inference/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Not Found:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings1",
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [http://localhost:8321/v1/inference/embeddings1] for request from inference entity id [llama-text-embedding] status [404]. Error message: [{\"detail\":\"Not Found\"}]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [http://localhost:8321/v1/inference/embeddings1] for request from inference entity id [llama-text-embedding] status [404]. Error message: [{\"detail\":\"Not Found\"}]"
        }
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings",
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/inference/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Perform Embedding

Bad Request:

RQ:
POST {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "query": "string",
    "task_settings": {}
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: Field [input] cannot be null;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Field [input] cannot be null;"
    },
    "status": 400
}

Success:

RQ:
POST {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
{
    "text_embedding": [
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        }
    ]
}
Create Completion Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/completion/llama-completion
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{api-key}}",
        "model_id": "llama3.2:3b"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/completion/llama-completion
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "inference_id": "llama-completion",
    "task_type": "completion",
    "service": "llama",
    "service_settings": {
        "model_id": "llama3.2:3b",
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform Completion

Success (Non-Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
{
    "completion": [
        {
            "result": "You're quoting Joseph Heller's classic novel \"Catch-22\". The famous line from Chapter 14 reads:\n\n\"The sky above the port was the color of television set left on at high heat, which caught the sun in its glassy eye like a garnish on a prawns cocktail.\"\n\nThe phrase has since become a metaphor for a sense of desolation and hopelessness, often used to describe the feeling of being stuck or trapped in a situation."
        }
    ]
}

Success (Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion/_stream
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
event: message
data: {"completion":[{"delta":"That"}]}

event: message
data: {"completion":[{"delta":"'s"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" great"}]}

event: message
data: {"completion":[{"delta":" quote"}]}

event: message
data: [DONE]

Bad Request(Non-Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion
{
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: Field [input] cannot be null;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Field [input] cannot be null;"
    },
    "status": 400
}

Bad Request (Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion/_stream
{
}
RS:
event: error
data: {"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: Field [input] cannot be null;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: Field [input] cannot be null;"},"status":400}

Create Chat Completion Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/chat_completion/llama-chat-completion
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{mistral-api-key}}",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/chat_completion/llama-chat-completion
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "api_key": "{{mistral-api-key}}",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "inference_id": "llama-chat-completion",
    "task_type": "chat_completion",
    "service": "llama",
    "service_settings": {
        "model_id": "ollama/llama3.2:3b",
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform Chat Completion

Success (basic):

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS:
event: message
data: {"id":"chatcmpl-bc589b74-a744-418b-a856-fa11abd98c8c","choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":"length","index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-bc589b74-a744-418b-a856-fa11abd98c8c","choices":[],"model":"llama3.2:3b","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":30,"total_tokens":40}}

event: message
data: [DONE]

Success (Complex):

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "model": "llama3.2:3b",
    "messages": [{
            "role": "user",
            "content": [{
                    "type": "text",
                    "text": "What's the price of a scarf?"
                }
            ]
        }
    ],
    "tools": [{
            "type": "function",
            "function": {
                "name": "get_current_price",
                "description": "Get the current price of a item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "id": "123"
                        }
                    }
                }
            }
        }
    ],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_price"
        }
    }
}

RS:
event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[{"delta":{"content":"","role":"assistant","tool_calls":[{"index":0,"id":"call_4qiq7n2n","function":{"arguments":"{\"item\":\"scarf\"}","name":"get_current_price"},"type":"function"}]},"index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":"tool_calls","index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[],"model":"llama3.2:3b","object":"chat.completion.chunk","usage":{"completion_tokens":15,"prompt_tokens":160,"total_tokens":175}}

event: message
data: [DONE]

Invalid Model:

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "model": "ggg",
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS:
event: error
data: {"error":{"code":"stream_error","message":"Received an error response for request from inference entity id [llama-chat-completion]. Error message: [400: Invalid value: Model 'ggg' not found]","type":"llama_error"}}


  • - Have you signed the contributor license agreement?
  • - Have you followed the contributor guidelines?
  • - If submitting code, have you built your formula locally prior to submission with gradle check?
  • - If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • - If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • - If you are submitting this code for a class then read our policy for that.

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 26, 2025
Jan-Kazlouski-elastic and others added 24 commits June 26, 2025 21:14
…r handling and improve error response parsing
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic Jan-Kazlouski-elastic marked this pull request as ready for review July 4, 2025 14:02
@Jan-Kazlouski-elastic Jan-Kazlouski-elastic requested a review from a team as a code owner July 4, 2025 14:02
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic
Copy link
Contributor Author

@jonathan-buttner Thank you for your comments. They are addressed and PR ready to be re-reviewed.

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, left a few more suggestions.

public static final TransportVersion ML_INFERENCE_COHERE_API_VERSION_8_19 = def(8_841_0_60);
public static final TransportVersion ESQL_DOCUMENTS_FOUND_AND_VALUES_LOADED_8_19 = def(8_841_0_61);
public static final TransportVersion ESQL_PROFILE_INCLUDE_PLAN_8_19 = def(8_841_0_62);
public static final TransportVersion ESQL_SPLIT_ON_BIG_VALUES_8_19 = def(8_841_0_63);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to mention this in the previous review, we won't be backporting this to 8.x so we can remove this transport version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


@Override
public TransportVersion getMinimalSupportedVersion() {
assert false : "should never be called when supportsVersion is used";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can remove this line now because we won't need to backport to 8.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

@Override
public boolean supportsVersion(TransportVersion version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


@Override
public TransportVersion getMinimalSupportedVersion() {
assert false : "should never be called when supportsVersion is used";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this.

return TransportVersions.ML_INFERENCE_LLAMA_ADDED;
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this method override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


@Override
public int rateLimitGroupingHash() {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. In the future, let's add these bug fix changes to a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

}

protected abstract CustomModel createEmbeddingModel(@Nullable SimilarityMeasure similarityMeasure);
protected abstract Model createEmbeddingModel(@Nullable SimilarityMeasure similarityMeasure);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

}
}

public void testParseRequestConfig_CreatesChatCompletionsModel() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the base class covers this test, can you check and see if this test covers anything additional, if not, let's remove it from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Added check for model_id to assert common model method since it was missed before.

@jonathan-buttner
Copy link
Contributor

@Jan-Kazlouski-elastic I did some testing, things are looking good. I think there's one scenario we should add better validation error handling for.

I was struggling to get the llama3.2:3b to be recognized and finally realized that I need to prepend ollama/ in the model_id field. If you use a model string like llama3.2:3bbbb the inference endpoint will still be created even though the test request our inference plugin makes receives:

data: {"error": {"message": "400: Invalid value: Model 'llama3.2:3b' not found"}}
PUT _inference/chat_completion/chat
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "model_id": "llama3.2:3bbbb"
    }
}

I think a better experience would be for the PUT request to fail and report back the error it received. This is probably a larger change unrelated to this implementation though. I'll create an issue to improve the validation.

@jonathan-buttner
Copy link
Contributor

Could you fix the merge conflicts and then I'll approve and merge on Monday 👍

…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferencePlugin.java
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic
Copy link
Contributor Author

Conflicts are resolved. Adopted changes for Error Handling and for Service constructors from master.
FYI @jonathan-buttner

@jonathan-buttner jonathan-buttner merged commit beb18a8 into elastic:main Jul 18, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants