mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server #15108

65a · 2025-08-06T01:54:28Z

Editing first comment to match current state:

This pull adds support for a multimodal data in the /completions (and in a similar fashion, /embeddings) API endpoint. Instead of a string, list of tokens, or a mixed string token list as currently supported by that endpoint, this pull add support for a JSON object containing both a prompt_string and and a multimodal_data field. The client should check the result of /models or /v1/models for the multimodal capability, sending multimodal data to a non-multimodal model will result in a request error.

A singular request example is like this:

 {
    ...
    "prompt": { "prompt_string": "string", "multimodal_data": ["base64"] }
    ...
{

A multiple (and mixed-type) request would look like:

 {
    ...
    "prompt": [
        { "prompt_string": "string", "multimodal_data": ["base64"] },
        "What's up?",
        [ 0, 1, 2, 3, "breakfast" ]
    ]
    ...
{

All existing tests pass, new tests are added to cover both the prompt splitting and visual inference. If a prompt is added and the model does not support MTMD, an error is returned. Checking for the multimodal in capability /models API serves two purposes: it means the model supports it and the llama.cpp is new enough to support this interface. The multimodal part should be a base64-encoded media data supported by libmtmd.

With this approach, other server endpoints can be multimodal in the future relatively easily (Rerank and infill are close but would need additional work and testing, etc). Feedback welcome!

Implement a basic way to include multimodal data in the completions endpoint. For now, this just supports directly included data, in base64-encoded format provided as an array of strings under the json key multimodal_data. Documentation updated to match. Local testing shows no regression for without-media case, and successful image processing with media provided from a custom client.

Similar to #14016 but avoids importing the URL fetching logic at all. It could be added later when factored out of the OpenAI-emulation code, but this is simpler for now and avoids need for URL parsing and remote fetch capabilities.

Original referenced issue was #13872

@ngxson ptal, hopefully this works for you.

tools/server/README.md

ngxson · 2025-08-07T00:31:38Z

The proposal looks ok but there will be some edge cases:

What if user enter a list of tokens or mixed list of tokens instead of a raw text prompt?
What if user enter a list of multiple prompts?

I think proper test cases are required for this PR, similar to /chat/completion

65a · 2025-08-07T00:50:40Z

Let me look at those.

The token case is interesting, interested in your thoughts there. The current server uses null tokens, but already knows how many to insert, which seems hard on the client (have to know the multimodal embedding size before sending raw tokens to completion endpoint). A magic token could work similarly to <__media__> but I guess this might require behavior changes. Need to poke at code to understand that path better, test_completions.py has no tests for token prompt requests. Easy option is to just document token + multimodal is not yet supported and throw if it happens.

The multiple text prompt part is also interesting from a usability perspective. I'll think about these and come back. The multi-prompt case should be straightforward to add tests for, not sure how it ought to work yet.

65a · 2025-08-07T01:09:30Z

I have an idea that might be usable, namely that prompt can now contain an array of JSON objects, like "prompt": [{ prompt: "foo", multimodal_data: ["<base64>"] }, { prompt: "bar", multimodal_data: ["<base64>"] }]. This would be a specific documented option for the prompt field in /completions, and only allow string multimodal prompts (for now). It's a bit verbose and requires more code refactoring, but as a user it makes sense to pair my prompts with their respective data. This would extend the currently supported prompt field options, and not change the top level data JSON object. Also easy to add additional tests for. I'll try it out locally, but does that idea work for you @ngxson ?

65a · 2025-08-07T04:37:36Z

Rough draft for the idea here (it compiles and passes existing tests): https://pastebin.com/8zek7ium

Not complete or properly indented, but the idea is to use server_tokens in more places, so that the input tokenizer can branch and use MTMD tokenization where it makes sense to do so. As a side effect, probably got multimodal support in embeddings. Infill needs more work, and rerank would work if I can get the push_back(server_tokens) for server_tokens to work properly, I think.

There are probably better ways to do some of this than I did, feedback welcome.

65a · 2025-08-07T06:04:35Z

Improved version of the rough draft that actually works, ignore indentation: https://pastebin.com/R6NdKQPP

This works locally for my use case, and I've started adding tests. There are a few TODOs to make doc ranking and embeddings support multimodal usecases, and I think the oai case can also be streamlined.

The general approach is as described previously: use server_tokens in more places, break out mtmd prompt parsing into a function, and change various input tokenization calls to handle server_tokens instead of llama_tokens.

The request format for multiple prompts would be like this:

{
    ...
    "prompt": [
            "Prompt 1",
            [ 1, 2, 3 , 4],
            { "prompt": "What is a tomato?", "multimodal_data": "<base64>"
    ],
    ....
}

The JSON entry only supports what mtmd_tokenize supports if multimodal_data is present. A single JSON object can also be provided instead of the array of prompts as above. A JSON object could also be provided containing only the prompt in either location, which is handled similarly to normal prompt strings.

65a · 2025-08-08T03:29:57Z

Added tests including vision test. Should be good for a review pass. There is some potential future work, including supporting multimodal prompts in document rerank and infill. Embeddings may already work, existing tests pass, but I didn't try it and not sure it's expected to provide a stable embedding or not. Further refactoring is possible to streamline the OAI chat path into the rest, but probably a follow up. @ngxson let me know what you think.

65a · 2025-08-09T00:09:57Z

Cleaned up the code quite a bit, and fixed the TODO around server_tokens.push_back(server_tokens). Now the tokenize_inputs handling reads a lot cleaner, which is nice.

oobabooga · 2025-08-10T04:23:13Z

I have tested this PR and it worked perfectly ✅

Here is a simple test with google_gemma-3-4b-it-Q4_K_S.gguf:

The prompt was:

<bos><start_of_turn>user
<__media__>

What do you see here?<end_of_turn>
<start_of_turn>model

The details of my UI integration are here oobabooga/text-generation-webui#7027

tools/server/tests/unit/test_vision_completion.py

ngxson

Looking good, can be merge after my comments are all resolved

tools/server/tests/unit/test_vision_completion.py

tools/server/utils.hpp

tools/server/server.cpp

tools/server/utils.hpp

tools/server/tests/utils.py

65a · 2025-08-12T00:15:43Z

A few notable updates:

Refactored prompt and multimodal_data to prompt_string and multimodal_data. This avoids having duplicate keys and opens up the possibility of having things like prompt_tokens or multimodal_urls in the future.
Sending multimodal data to a model without multimodal capability now results in a request error. To avoid that, clients should check /v1/models or /models for the multimodal capabability. Tests added for this new capability as well.
Embeddings work and are now documented and tested as multimodal
Removed some text about image_data in the docs that didn't appear to be implemented yet, more docs for multimodal
Made test names better and merged them into the one file.

@ngxson ready for re-review, I won't resolve your comments in case you want to discuss any of the changes.

65a · 2025-08-12T00:44:13Z

@ngxson I believe second round of comments are now addressed!

@oobabooga I saw you already were testing this, thanks. Please note the API has changed slightly. The client should check if multimodal is supported via the /models endpoint before sending a prompt that includes the multimodal_data key. The prompt key was renamed to prompt_string. Thanks for testing it!

oobabooga · 2025-08-12T00:45:18Z

Thanks for the heads up @65a, I have updated the request! oobabooga/text-generation-webui@e6447cd

65a · 2025-08-12T01:48:34Z

Can't reproduce sanitizer test timeout failure locally, hopefully pushing an updated commit message can retrigger CI.

65a · 2025-08-13T00:13:25Z

@ngxson resolved trivial fixes, left the more interesting ones open to see if you're happy. Thanks for your review.

65a · 2025-08-15T00:38:49Z

Rebased on HEAD, resolving open comments. Please feel free to re-open or comment if there's anything remaining @ngxson

oobabooga · 2025-08-18T22:30:50Z

I have merged this PR in my fork of llama.cpp and am distributing binaries including the changes here. Many people have already tested it and verified the implementation works.

ngxson · 2025-08-18T22:34:49Z

I was not available the past few days. Will review this PR tomorrow.

tools/server/tests/unit/test_vision_api.py

tools/server/utils.hpp

… format - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.

65a · 2025-08-19T14:50:37Z

@ngxson comments fixed and resolved, rebased on HEAD.

65a · 2025-08-21T00:24:47Z

Looks good now?

… format (ggml-org#15108) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.

65a requested a review from ngxson as a code owner August 6, 2025 01:54

github-actions bot added examples server labels Aug 6, 2025

pnb reviewed Aug 6, 2025

View reviewed changes

tools/server/README.md Outdated Show resolved Hide resolved

65a force-pushed the master branch from ccd2814 to 9ca7808 Compare August 8, 2025 01:33

github-actions bot added the python python script changes label Aug 8, 2025

65a force-pushed the master branch 2 times, most recently from 744d758 to 62f3bae Compare August 8, 2025 03:23

65a changed the title ~~mtmd: server: Support basic multimodal data in /completions endpoint of server~~ mtmd: server: Support multimodal data prompt in /completions endpoint of server Aug 8, 2025

65a force-pushed the master branch from 62f3bae to 3cf34d5 Compare August 9, 2025 00:05

65a force-pushed the master branch 2 times, most recently from 5359dda to 234531f Compare August 9, 2025 00:31

oobabooga mentioned this pull request Aug 9, 2025

Add multimodal support (ExLlamaV3) oobabooga/text-generation-webui#7174

Merged

4 tasks

pnb reviewed Aug 10, 2025

View reviewed changes

tools/server/tests/unit/test_vision_completion.py Outdated Show resolved Hide resolved

65a force-pushed the master branch from 234531f to 23ad1ff Compare August 10, 2025 22:19

ngxson reviewed Aug 11, 2025

View reviewed changes

65a force-pushed the master branch from 23ad1ff to 4971043 Compare August 12, 2025 00:04

65a changed the title ~~mtmd: server: Support multimodal data prompt in /completions endpoint of server~~ mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server Aug 12, 2025

ngxson reviewed Aug 12, 2025

View reviewed changes

tools/server/utils.hpp Outdated Show resolved Hide resolved

tools/server/utils.hpp Outdated Show resolved Hide resolved

tools/server/utils.hpp Outdated Show resolved Hide resolved

tools/server/tests/utils.py Show resolved Hide resolved

65a force-pushed the master branch 2 times, most recently from 398d0fe to 58b9c3e Compare August 12, 2025 00:32

65a force-pushed the master branch from 58b9c3e to 695bb37 Compare August 12, 2025 01:49

65a requested a review from ngxson August 13, 2025 00:12

65a force-pushed the master branch from 695bb37 to 1a54a85 Compare August 15, 2025 00:36

65a force-pushed the master branch from 1a54a85 to 12e54ae Compare August 15, 2025 15:47