Skip to content

mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server #15108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions tools/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,10 @@ services:
### Multimodal support

Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
It is currently available in the following endpoints:
- The OAI-compatible chat endpoint.
- The non-OAI-compatible completions endpoint.
- The non-OAI-compatible embeddings endpoint.

For more details, please refer to [multimodal documentation](../../docs/multimodal.md)

Expand Down Expand Up @@ -400,12 +404,15 @@ These input shapes and data type are allowed for `prompt`:
- Single string: `"string"`
- Single sequence of tokens: `[12, 34, 56]`
- Mixed tokens and strings: `[12, 34, "string", 56, 78]`
- A JSON object which optionally contains multimodal data: `{ "prompt_string": "string", "multimodal_data": ["base64"] }`

Multiple prompts are also supported. In this case, the completion result will be an array.

- Only strings: `["string1", "string2"]`
- Strings and sequences of tokens: `["string1", [12, 34, 56]]`
- Mixed types: `[[12, 34, "string", 56, 78], [12, 34, 56], "string"]`
- Strings, JSON objects, and sequences of tokens: `["string1", [12, 34, 56], { "prompt_string": "string", "multimodal_data": ["base64"]}]`
- Mixed types: `[[12, 34, "string", 56, 78], [12, 34, 56], "string", { "prompt_string": "string" }]`

Note for `multimodal_data` in JSON object prompts. This should be an array of strings, containing base64 encoded multimodal data such as images and audio. There must be an identical number of MTMD media markers in the string prompt element which act as placeholders for the data provided to this parameter. The multimodal data files will be substituted in order. The marker string (e.g. `<__media__>`) can be found by calling `mtmd_default_marker()` defined in [the MTMD C API](https://github.com/ggml-org/llama.cpp/blob/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0/tools/mtmd/mtmd.h#L87). A client *must not* specify this field unless the server has the multimodal capability. Clients should check `/models` or `/v1/models` for the `multimodal` capability before a multimodal request.

`temperature`: Adjust the randomness of the generated text. Default: `0.8`

Expand Down Expand Up @@ -477,8 +484,6 @@ These words will not be included in the completion, so make sure to add them to

`t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.

`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `prompt`. You can determine the place of the image in the prompt as in the following: `USER:[img-12]Describe the image in detail.\nASSISTANT:`. In this case, `[img-12]` will be replaced by the embeddings of the image with id `12` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.

`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`

`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `true`
Expand Down Expand Up @@ -638,12 +643,12 @@ Returns a JSON object with a field `prompt` containing a string of the input mes

The same as [the embedding example](../embedding) does.

This endpoint also supports multimodal embeddings. See the documentation for the `/completions` endpoint for details on how to send a multimodal prompt.

*Options:*

`content`: Set the text to process.

`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.

`embd_normalize`: Normalization for pooled embeddings. Can be one of the following values:
```
-1: No normalization
Expand Down
77 changes: 20 additions & 57 deletions tools/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4181,6 +4181,7 @@ int main(int argc, char ** argv) {
};

const auto handle_api_show = [&ctx_server, &res_ok](const httplib::Request &, httplib::Response & res) {
bool has_mtmd = ctx_server.mctx != nullptr;
json data = {
{
"template", common_chat_templates_source(ctx_server.chat_templates.get()),
Expand All @@ -4202,7 +4203,7 @@ int main(int argc, char ** argv) {
{"quantization_level", ""}
}},
{"model_info", ""},
{"capabilities", {"completion"}}
{"capabilities", has_mtmd ? json({"completion","multimodal"}) : json({"completion"})}
};

res_ok(res, data);
Expand All @@ -4228,56 +4229,15 @@ int main(int argc, char ** argv) {
// TODO: this log can become very long, put it behind a flag or think about a more compact format
//SRV_DBG("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());

// process files
mtmd::bitmaps bitmaps;
const bool has_mtmd = ctx_server.mctx != nullptr;
{
if (!has_mtmd && !files.empty()) {
throw std::runtime_error("This server does not support multimodal");
}
for (auto & file : files) {
mtmd::bitmap bmp(mtmd_helper_bitmap_init_from_buf(ctx_server.mctx, file.data(), file.size()));
if (!bmp.ptr) {
throw std::runtime_error("Failed to load image or audio file");
}
// calculate bitmap hash (for KV caching)
std::string hash = fnv_hash(bmp.data(), bmp.n_bytes());
bmp.set_id(hash.c_str());
bitmaps.entries.push_back(std::move(bmp));
}
}

// process prompt
std::vector<server_tokens> inputs;

if (oaicompat && has_mtmd) {
// multimodal
std::string prompt_str = prompt.get<std::string>();
mtmd_input_text inp_txt = {
prompt_str.c_str(),
/* add_special */ true,
/* parse_special */ true,
};
mtmd::input_chunks chunks(mtmd_input_chunks_init());
auto bitmaps_c_ptr = bitmaps.c_ptr();
int32_t tokenized = mtmd_tokenize(ctx_server.mctx,
chunks.ptr.get(),
&inp_txt,
bitmaps_c_ptr.data(),
bitmaps_c_ptr.size());
if (tokenized != 0) {
throw std::runtime_error("Failed to tokenize prompt");
}

server_tokens tmp(chunks, true);
inputs.push_back(std::move(tmp));
if (oaicompat && ctx_server.mctx != nullptr) {
// This is the case used by OAI compatible chat path with MTMD. TODO It can be moved to the path below.
inputs.push_back(process_mtmd_prompt(ctx_server.mctx, prompt.get<std::string>(), files));
} else {
// non-multimodal version
auto tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
for (auto & p : tokenized_prompts) {
auto tmp = server_tokens(p, ctx_server.mctx != nullptr);
inputs.push_back(std::move(tmp));
}
// Everything else, including multimodal completions.
inputs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true);
}

tasks.reserve(inputs.size());
Expand Down Expand Up @@ -4446,7 +4406,7 @@ int main(int argc, char ** argv) {
data["input_extra"] = input_extra; // default to empty array if it's not exist

std::string prompt = json_value(data, "prompt", std::string());
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, false, true);
std::vector<server_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, false, true);
SRV_DBG("creating infill tasks, n_prompts = %d\n", (int) tokenized_prompts.size());
data["prompt"] = format_infill(
ctx_server.vocab,
Expand All @@ -4457,7 +4417,7 @@ int main(int argc, char ** argv) {
ctx_server.params_base.n_predict,
ctx_server.slots[0].n_ctx, // TODO: there should be a better way
ctx_server.params_base.spm_infill,
tokenized_prompts[0]
tokenized_prompts[0].get_text_tokens() // TODO: this could maybe be multimodal.
);

std::vector<raw_buffer> files; // dummy
Expand Down Expand Up @@ -4506,7 +4466,7 @@ int main(int argc, char ** argv) {
if (current_state == SERVER_STATE_READY) {
model_meta = ctx_server.model_meta();
}

bool has_mtmd = ctx_server.mctx != nullptr;
json models = {
{"models", {
{
Expand All @@ -4518,7 +4478,7 @@ int main(int argc, char ** argv) {
{"type", "model"},
{"description", ""},
{"tags", {""}},
{"capabilities", {"completion"}},
{"capabilities", has_mtmd ? json({"completion","multimodal"}) : json({"completion"})},
{"parameters", ""},
{"details", {
{"parent_model", ""},
Expand Down Expand Up @@ -4635,7 +4595,7 @@ int main(int argc, char ** argv) {
}
}

auto tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
auto tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true);
for (const auto & tokens : tokenized_prompts) {
// this check is necessary for models that do not add BOS token to the input
if (tokens.empty()) {
Expand Down Expand Up @@ -4663,7 +4623,7 @@ int main(int argc, char ** argv) {

task.id = ctx_server.queue_tasks.get_new_id();
task.index = i;
task.prompt_tokens = server_tokens(tokenized_prompts[i], ctx_server.mctx != nullptr);
task.prompt_tokens = std::move(tokenized_prompts[i]);

// OAI-compat
task.params.oaicompat = oaicompat;
Expand Down Expand Up @@ -4750,22 +4710,25 @@ int main(int argc, char ** argv) {
return;
}

llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.vocab, query, /* add_special */ false, true)[0];
std::vector<server_tokens> tokenized_queries = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, query, /* add_special */ false, true);
if (tokenized_queries.size() != 1) {
res_error(res, format_error_response("\"query\" must contain only a single prompt", ERROR_TYPE_INVALID_REQUEST));
}

// create and queue the task
json responses = json::array();
bool error = false;
std::unordered_set<int> task_ids;
{
std::vector<server_task> tasks;
auto tokenized_docs = tokenize_input_prompts(ctx_server.vocab, documents, /* add_special */ false, true);
auto tokenized_docs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, documents, /* add_special */ false, true);
tasks.reserve(tokenized_docs.size());
for (size_t i = 0; i < tokenized_docs.size(); i++) {
auto tmp = format_rerank(ctx_server.vocab, tokenized_query, tokenized_docs[i]);
auto tmp = format_rerank(ctx_server.vocab, tokenized_queries[0], tokenized_docs[i]);
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
task.id = ctx_server.queue_tasks.get_new_id();
task.index = i;
task.prompt_tokens = server_tokens(tmp, ctx_server.mctx != nullptr);
task.prompt_tokens = std::move(tmp);
tasks.push_back(std::move(task));
}

Expand Down
38 changes: 38 additions & 0 deletions tools/server/tests/unit/test_completion.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

server = ServerPreset.tinyllama2()

JSON_MULTIMODAL_KEY = "multimodal_data"
JSON_PROMPT_STRING_KEY = "prompt_string"

@pytest.fixture(scope="module", autouse=True)
def create_server():
Expand Down Expand Up @@ -231,6 +233,28 @@ def test_nocache_long_input_prompt():
})
assert res.status_code == 200

def test_json_prompt_no_mtmd():
global server
server.start()
res = server.make_request("POST", "/completion", data={
"prompt": { JSON_PROMPT_STRING_KEY: "I believe the meaning of life is" },
"seed": 42,
"temperature": 1.0,
"cache_prompt": False,
})
assert res.status_code == 200

def test_json_prompt_mtm_error_when_not_supported():
global server
server.start()
res = server.make_request("POST", "/completion", data={
"prompt": { JSON_PROMPT_STRING_KEY: "I believe the meaning of life is <__media__>", JSON_MULTIMODAL_KEY: "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNk+A8AAQUBAScY42YAAAAASUVORK5CYII=" },
"seed": 42,
"temperature": 1.0,
"cache_prompt": False,
})
# MTMD is disabled on this model, so this should fail.
assert res.status_code != 200

def test_completion_with_tokens_input():
global server
Expand Down Expand Up @@ -269,6 +293,20 @@ def test_completion_with_tokens_input():
assert len(res.body) == 2
assert res.body[0]["content"] == res.body[1]["content"]

# mixed JSON and tokens
res = server.make_request("POST", "/completion", data={
"prompt": [
tokens,
{
JSON_PROMPT_STRING_KEY: "I believe the meaning of life is",
},
],
})
assert res.status_code == 200
assert type(res.body) == list
assert len(res.body) == 2
assert res.body[0]["content"] == res.body[1]["content"]

# mixed string and tokens in one sequence
res = server.make_request("POST", "/completion", data={
"prompt": [1, 2, 3, 4, 5, 6, prompt_str, 7, 8, 9, 10, prompt_str],
Expand Down
Loading
Loading