Skip to content

mtmd: server: Support multimodal data prompt in /completions endpoint of server #15108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

65a
Copy link
Contributor

@65a 65a commented Aug 6, 2025

Implement a basic way to include multimodal data in the completions endpoint. For now, this just supports directly included data, in base64-encoded format provided as an array of strings under the json key multimodal_data. Documentation updated to match. Local testing shows no regression for without-media case, and successful image processing with media provided from a custom client.

Similar to #14016 but avoids importing the URL fetching logic at all. It could be added later when factored out of the OpenAI-emulation code, but this is simpler for now and avoids need for URL parsing and remote fetch capabilities.

Original referenced issue was #13872

@ngxson ptal, hopefully this works for you.

@ngxson
Copy link
Collaborator

ngxson commented Aug 7, 2025

The proposal looks ok but there will be some edge cases:

  • What if user enter a list of tokens or mixed list of tokens instead of a raw text prompt?
  • What if user enter a list of multiple prompts?

I think proper test cases are required for this PR, similar to /chat/completion

@65a
Copy link
Contributor Author

65a commented Aug 7, 2025

Let me look at those.

The token case is interesting, interested in your thoughts there. The current server uses null tokens, but already knows how many to insert, which seems hard on the client (have to know the multimodal embedding size before sending raw tokens to completion endpoint). A magic token could work similarly to <__media__> but I guess this might require behavior changes. Need to poke at code to understand that path better, test_completions.py has no tests for token prompt requests. Easy option is to just document token + multimodal is not yet supported and throw if it happens.

The multiple text prompt part is also interesting from a usability perspective. I'll think about these and come back. The multi-prompt case should be straightforward to add tests for, not sure how it ought to work yet.

@65a
Copy link
Contributor Author

65a commented Aug 7, 2025

I have an idea that might be usable, namely that prompt can now contain an array of JSON objects, like "prompt": [{ prompt: "foo", multimodal_data: ["<base64>"] }, { prompt: "bar", multimodal_data: ["<base64>"] }]. This would be a specific documented option for the prompt field in /completions, and only allow string multimodal prompts (for now). It's a bit verbose and requires more code refactoring, but as a user it makes sense to pair my prompts with their respective data. This would extend the currently supported prompt field options, and not change the top level data JSON object. Also easy to add additional tests for. I'll try it out locally, but does that idea work for you @ngxson ?

@65a
Copy link
Contributor Author

65a commented Aug 7, 2025

Rough draft for the idea here (it compiles and passes existing tests): https://pastebin.com/8zek7ium

Not complete or properly indented, but the idea is to use server_tokens in more places, so that the input tokenizer can branch and use MTMD tokenization where it makes sense to do so. As a side effect, probably got multimodal support in embeddings. Infill needs more work, and rerank would work if I can get the push_back(server_tokens) for server_tokens to work properly, I think.

There are probably better ways to do some of this than I did, feedback welcome.

@65a
Copy link
Contributor Author

65a commented Aug 7, 2025

Improved version of the rough draft that actually works, ignore indentation: https://pastebin.com/R6NdKQPP

This works locally for my use case, and I've started adding tests. There are a few TODOs to make doc ranking and embeddings support multimodal usecases, and I think the oai case can also be streamlined.

The general approach is as described previously: use server_tokens in more places, break out mtmd prompt parsing into a function, and change various input tokenization calls to handle server_tokens instead of llama_tokens.

The request format for multiple prompts would be like this:

{
    ...
    "prompt": [
            "Prompt 1",
            [ 1, 2, 3 , 4],
            { "prompt": "What is a tomato?", "multimodal_data": "<base64>"
    ],
    ....
}

The JSON entry only supports what mtmd_tokenize supports if multimodal_data is present. A single JSON object can also be provided instead of the array of prompts as above. A JSON object could also be provided containing only the prompt in either location, which is handled similarly to normal prompt strings.

@github-actions github-actions bot added the python python script changes label Aug 8, 2025
@65a 65a force-pushed the master branch 2 times, most recently from 744d758 to 62f3bae Compare August 8, 2025 03:23
@65a
Copy link
Contributor Author

65a commented Aug 8, 2025

Added tests including vision test. Should be good for a review pass. There is some potential future work, including supporting multimodal prompts in document rerank and infill. Embeddings may already work, existing tests pass, but I didn't try it and not sure it's expected to provide a stable embedding or not. Further refactoring is possible to streamline the OAI chat path into the rest, but probably a follow up. @ngxson let me know what you think.

@65a 65a changed the title mtmd: server: Support basic multimodal data in /completions endpoint of server mtmd: server: Support multimodal data prompt in /completions endpoint of server Aug 8, 2025
@65a
Copy link
Contributor Author

65a commented Aug 9, 2025

Cleaned up the code quite a bit, and fixed the TODO around server_tokens.push_back(server_tokens). Now the tokenize_inputs handling reads a lot cleaner, which is nice.

- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add tests
@oobabooga
Copy link
Contributor

I have tested this PR and it worked perfectly ✅

Here is a simple test with google_gemma-3-4b-it-Q4_K_S.gguf:

print

The prompt was:

<bos><start_of_turn>user
<__media__>

What do you see here?<end_of_turn>
<start_of_turn>model

The details of my UI integration are here oobabooga/text-generation-webui#7027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants