model : jina-embeddings-v3 support #13693

CISC · 2025-05-21T18:37:57Z

Work checklist

Task selection (using/enabling the correct LoRA) and prompt prefix is left to the user, but is made simpler by providing task name and prompt prefix in LoRA GGUF metadata (available from llama-server in /lora-adapters endpoint).

Fixes #12327
Fixes #9585

CISC · 2025-05-22T10:28:21Z

Apart from a few minor differences (unsure why) in the tokenizer test, the tokenizer.json parsing seems to work perfectly:

  1135 '`', 164721 '``', 164721 '``', 164721 '``'

vs

164721 '``', 164721 '``', 164721 '``',   1135 '`'

and

   32 '?',  85908 '?????'

vs

85908 '?????',     32 '?'

Edit: Fixed in #13743

CISC · 2025-05-23T06:29:09Z

@ngxson @slaren When you have the time I would appreciate some feedback on how best to tackle the task LoRAs of this model.

I think the best user-experience would probably be to keep them embedded and add extra metadata for their names so they can be easily chosen via an option. However that increases the scope of this PR quite a bit as new mechanisms would need to be added to load and apply the right LoRA tensors at runtime. This seems a little excessive for just one model, but maybe it can be useful for others as well, I don't know?

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

The PR as-is now works great and produces identical embeddings as the original using transformers with no task specified, but the main selling-point of the model is using the tasks so I thinks it's important to do this right.

ngxson · 2025-05-23T06:40:40Z

I was thinking about supporting built-in lora lately, as this is required for phi-4-multimodal. We can extend the current lora API to support this case, but eventually end-user need a way to select this (for example via llama-server). For multimodal, it can be done easily via libmtmd.

Another approach could be to add an enum of pre-defined lora types, and user code can switch it at runtime. This is based on an earlier suggestion from @ggerganov about having multiple models in the same gguf.

If I have time this weekend, I can push a draft PR on how this can be done.

ggerganov · 2025-05-23T06:40:56Z

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Btw, did you resolve the tokenization differences?

CISC · 2025-05-23T06:45:51Z

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

ngxson · 2025-05-23T06:46:08Z

The lora api on server is quite low-level, also downstream apps will have to explicitly set the lora accordingly to use case, which may not be a good UX overall, especially when the lora provides commonly known tasks like embeddings or reranking

~~For tokenization, I think built-in lora is not affected by this, as they use the same tokenizer as base model~~ Unrelated answer

ngxson · 2025-05-23T06:47:49Z

Another option could be to consider it as an extension to the embedding pooling selection

ggerganov · 2025-05-23T06:50:12Z

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Ok I understand - the adapters are embedded inside the GGUF file together with the model and we don't have a mechanism to load them.

ggerganov · 2025-05-23T06:54:15Z

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

CISC · 2025-05-23T06:55:17Z

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

The thinking was that the prompt could be prefixed depending on task selection (easily stored as metadata).

CISC · 2025-05-23T07:41:38Z

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ggerganov I can confirm that it's an issue with the UGM tokenizer, the same thing happens with nomic-embed-text-v2-moe f.ex.

ngxson · 2025-05-23T10:11:15Z

I looked deeper into the jina model. It is a bit confused to me though:

There is only one single lora, not one lora per task as I initially thought
It's unclear: in which use case, we don't want to use LoRA?

If the use case of non LoRA is not practical, maybe it's more simple to just merge the LoRA into the weight

CISC · 2025-05-23T10:17:46Z

1. There is only one single lora, not one lora per task as I initially thought

No, there are 5 LoRAs, but they are all in the same (4D3D) tensor.

2. It's unclear: in which use case, we don't want to use LoRA?

Not sure, just for reference I guess?

CISC · 2025-05-23T10:21:09Z

See here for how the correct task LoRA is loaded in transformers:
https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/custom_st.py#L130-L141

ngxson · 2025-05-23T10:26:53Z

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

CISC · 2025-05-23T10:35:27Z

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

No, there are 5 adapters, the tensors are shaped like this: [tasks (5), rank (4), N]

CISC · 2025-06-01T19:55:53Z

@ngxson Had time to look at built-in LoRAs?

CISC · 2025-06-28T09:51:32Z

For reference jina-embeddings-v4 was just released, and this time the LoRAs are in a separate file with separate weights for each task.

Edit: Oh, and it's multimodal (Qwen2.5 VL).

common/common.h

common/common.cpp

convert_hf_to_gguf.py

include/llama.h

CISC · 2025-07-20T18:19:11Z

@slaren @compilade Should be good to go now, no longer conflicts with convert_hf_to_lora.py.

CISC · 2025-07-28T12:58:49Z

@compilade gentle ping

deiteris · 2025-08-16T12:28:00Z

Any progress on this?

convert_hf_to_gguf.py

CISC · 2025-08-26T08:34:10Z

@compilade @slaren ping, would be nice to get this merged soon...

common/common.cpp

CISC · 2025-08-28T13:51:42Z

Sigh, forgot to remove [no ci] from comments again...

Binozo · 2025-08-28T13:55:36Z

Thank you so much! @CISC

…nemotron-nano-15409 * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693)

…upport * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693) Signed-off-by: Gabe Goodhart <[email protected]>

* initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * fix vocab parsing with only tokenizer.json * set mask token lstrip attribute * additional unk_token_id fallback just in case [no ci] * revert vocab_size() change [no ci] * merge tensor loading into general bert * rope * add lora embedding and loading (non-functional) * export separate lora ggufs instead * add adapter metadata api * use std::string * convert_hf_to_lora compatibility * fix assert * apply suggestions from review * apply suggestion from review

CISC added 3 commits May 21, 2025 20:17

initial jina-embeddings-v3 support

6303ea2

initial jina-embeddings-v3 support

ba51f89

initial jina-embeddings-v3 support

4ac1380

github-actions bot added the python python script changes label May 21, 2025

fix vocab parsing with only tokenizer.json

1274c8c

set mask token lstrip attribute

65a37fa

additional unk_token_id fallback just in case [no ci]

f2d876a

CISC added 4 commits May 26, 2025 08:40

revert vocab_size() change [no ci]

b17e981

Merge branch 'master' into cisc/jina-embeddings-v3

4046701

merge tensor loading into general bert

f5d0305

rope

3862d95

Merge branch 'master' into cisc/jina-embeddings-v3

8ab8d36

slaren reviewed Jul 8, 2025

View reviewed changes

common/common.h Outdated Show resolved Hide resolved

common/common.cpp Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Show resolved Hide resolved

include/llama.h Show resolved Hide resolved

use std::string

9b67ea2

CISC removed the help wanted Needs help from the community label Jul 8, 2025

CISC requested a review from compilade July 9, 2025 09:00

CISC added 2 commits July 20, 2025 20:09

convert_hf_to_lora compatibility

e092776

Merge branch 'master' into cisc/jina-embeddings-v3

8009403

fix assert

d98c7d9

compilade reviewed Aug 16, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

CISC added 2 commits August 16, 2025 19:06

apply suggestions from review

1cda21e

Merge branch 'master' into cisc/jina-embeddings-v3

4395538

CISC requested a review from compilade August 16, 2025 17:29

CISC requested review from slaren and ggerganov August 27, 2025 10:03

CISC added the model Model specific label Aug 28, 2025

ggerganov approved these changes Aug 28, 2025

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

apply suggestion from review

324e80f

CISC merged commit 84ab83c into master Aug 28, 2025
51 of 52 checks passed

CISC deleted the cisc/jina-embeddings-v3 branch August 28, 2025 13:49

gabe-l-hart mentioned this pull request Aug 28, 2025

aLoRA Support #15327

Merged

1 task

jeremyfowers mentioned this pull request Sep 4, 2025

Support JINA Embeddings v3 lemonade-sdk/lemonade#345

Closed

model : jina-embeddings-v3 support #13693

model : jina-embeddings-v3 support #13693

Uh oh!

Conversation

CISC commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work checklist

Uh oh!

CISC commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025

Uh oh!

CISC commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

CISC commented May 23, 2025

Uh oh!

CISC commented Jun 1, 2025

Uh oh!

CISC commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 20, 2025

Uh oh!

CISC commented Jul 28, 2025

Uh oh!

deiteris commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

CISC commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

CISC commented Aug 28, 2025

Uh oh!

Binozo commented Aug 28, 2025

Uh oh!

Uh oh!

CISC commented May 21, 2025 •

edited

Loading

CISC commented May 22, 2025 •

edited

Loading

ggerganov commented May 23, 2025 •

edited

Loading

ngxson commented May 23, 2025 •

edited

Loading

CISC commented May 23, 2025 •

edited

Loading

ngxson commented May 23, 2025 •

edited

Loading

CISC commented May 23, 2025 •

edited

Loading

CISC commented Jun 28, 2025 •

edited

Loading