-
Notifications
You must be signed in to change notification settings - Fork 12.6k
mtmd: server: Support multimodal data prompt in /completions endpoint of server #15108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The proposal looks ok but there will be some edge cases:
I think proper test cases are required for this PR, similar to |
Let me look at those. The token case is interesting, interested in your thoughts there. The current server uses null tokens, but already knows how many to insert, which seems hard on the client (have to know the multimodal embedding size before sending raw tokens to completion endpoint). A magic token could work similarly to The multiple text prompt part is also interesting from a usability perspective. I'll think about these and come back. The multi-prompt case should be straightforward to add tests for, not sure how it ought to work yet. |
I have an idea that might be usable, namely that prompt can now contain an array of JSON objects, like |
Rough draft for the idea here (it compiles and passes existing tests): https://pastebin.com/8zek7ium Not complete or properly indented, but the idea is to use server_tokens in more places, so that the input tokenizer can branch and use MTMD tokenization where it makes sense to do so. As a side effect, probably got multimodal support in embeddings. Infill needs more work, and rerank would work if I can get the push_back(server_tokens) for server_tokens to work properly, I think. There are probably better ways to do some of this than I did, feedback welcome. |
Improved version of the rough draft that actually works, ignore indentation: https://pastebin.com/R6NdKQPP This works locally for my use case, and I've started adding tests. There are a few TODOs to make doc ranking and embeddings support multimodal usecases, and I think the oai case can also be streamlined. The general approach is as described previously: use server_tokens in more places, break out mtmd prompt parsing into a function, and change various input tokenization calls to handle server_tokens instead of llama_tokens. The request format for multiple prompts would be like this:
The JSON entry only supports what |
744d758
to
62f3bae
Compare
Added tests including vision test. Should be good for a review pass. There is some potential future work, including supporting multimodal prompts in document rerank and infill. Embeddings may already work, existing tests pass, but I didn't try it and not sure it's expected to provide a stable embedding or not. Further refactoring is possible to streamline the OAI chat path into the rest, but probably a follow up. @ngxson let me know what you think. |
Cleaned up the code quite a bit, and fixed the TODO around server_tokens.push_back(server_tokens). Now the tokenize_inputs handling reads a lot cleaner, which is nice. |
- Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add tests
I have tested this PR and it worked perfectly ✅ Here is a simple test with ![]() The prompt was:
The details of my UI integration are here oobabooga/text-generation-webui#7027 |
Implement a basic way to include multimodal data in the completions endpoint. For now, this just supports directly included data, in base64-encoded format provided as an array of strings under the json key
multimodal_data
. Documentation updated to match. Local testing shows no regression for without-media case, and successful image processing with media provided from a custom client.Similar to #14016 but avoids importing the URL fetching logic at all. It could be added later when factored out of the OpenAI-emulation code, but this is simpler for now and avoids need for URL parsing and remote fetch capabilities.
Original referenced issue was #13872
@ngxson ptal, hopefully this works for you.