Skip to content

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented May 21, 2025

Support for jina-embeddings-v3

Work checklist

  • Model conversion
  • Model inference
  • Implement correct vocab
  • Task LoRAs conversion
  • LoRA task selection
  • LoRA prompt prefix depending on task

Task selection (using/enabling the correct LoRA) and prompt prefix is left to the user, but is made simpler by providing task name and prompt prefix in LoRA GGUF metadata (available from llama-server in /lora-adapters endpoint).

Fixes #12327
Fixes #9585

@github-actions github-actions bot added the python python script changes label May 21, 2025
@CISC
Copy link
Collaborator Author

CISC commented May 22, 2025

Apart from a few minor differences (unsure why) in the tokenizer test, the tokenizer.json parsing seems to work perfectly:

  1135 '`', 164721 '``', 164721 '``', 164721 '``'

vs

164721 '``', 164721 '``', 164721 '``',   1135 '`'

and

   32 '?',  85908 '?????'

vs

85908 '?????',     32 '?'

Edit: Fixed in #13743

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

@ngxson @slaren When you have the time I would appreciate some feedback on how best to tackle the task LoRAs of this model.

I think the best user-experience would probably be to keep them embedded and add extra metadata for their names so they can be easily chosen via an option. However that increases the scope of this PR quite a bit as new mechanisms would need to be added to load and apply the right LoRA tensors at runtime. This seems a little excessive for just one model, but maybe it can be useful for others as well, I don't know?

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

The PR as-is now works great and produces identical embeddings as the original using transformers with no task specified, but the main selling-point of the model is using the tasks so I thinks it's important to do this right.

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

I was thinking about supporting built-in lora lately, as this is required for phi-4-multimodal. We can extend the current lora API to support this case, but eventually end-user need a way to select this (for example via llama-server). For multimodal, it can be done easily via libmtmd.

Another approach could be to add an enum of pre-defined lora types, and user code can switch it at runtime. This is based on an earlier suggestion from @ggerganov about having multiple models in the same gguf.

If I have time this weekend, I can push a draft PR on how this can be done.

@ggerganov
Copy link
Member

ggerganov commented May 23, 2025

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Btw, did you resolve the tokenization differences?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

The lora api on server is quite low-level, also downstream apps will have to explicitly set the lora accordingly to use case, which may not be a good UX overall, especially when the lora provides commonly known tasks like embeddings or reranking

For tokenization, I think built-in lora is not affected by this, as they use the same tokenizer as base model Unrelated answer

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

Another option could be to consider it as an extension to the embedding pooling selection

@ggerganov
Copy link
Member

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Ok I understand - the adapters are embedded inside the GGUF file together with the model and we don't have a mechanism to load them.

@ggerganov
Copy link
Member

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

The thinking was that the prompt could be prefixed depending on task selection (easily stored as metadata).

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ggerganov I can confirm that it's an issue with the UGM tokenizer, the same thing happens with nomic-embed-text-v2-moe f.ex.

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

I looked deeper into the jina model. It is a bit confused to me though:

  1. There is only one single lora, not one lora per task as I initially thought
  2. It's unclear: in which use case, we don't want to use LoRA?

If the use case of non LoRA is not practical, maybe it's more simple to just merge the LoRA into the weight

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

1. There is only one single lora, not one lora per task as I initially thought

No, there are 5 LoRAs, but they are all in the same (4D3D) tensor.

2. It's unclear: in which use case, we don't want to use LoRA?

Not sure, just for reference I guess?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

See here for how the correct task LoRA is loaded in transformers:
https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/custom_st.py#L130-L141

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

No, there are 5 adapters, the tensors are shaped like this: [tasks (5), rank (4), N]

@CISC
Copy link
Collaborator Author

CISC commented Jun 1, 2025

@ngxson Had time to look at built-in LoRAs?

@CISC
Copy link
Collaborator Author

CISC commented Jun 28, 2025

For reference jina-embeddings-v4 was just released, and this time the LoRAs are in a separate file with separate weights for each task.

Edit: Oh, and it's multimodal (Qwen2.5 VL).

@CISC CISC removed the help wanted Needs help from the community label Jul 8, 2025
@CISC CISC requested a review from compilade July 9, 2025 09:00
@CISC
Copy link
Collaborator Author

CISC commented Jul 20, 2025

@slaren @compilade Should be good to go now, no longer conflicts with convert_hf_to_lora.py.

@CISC
Copy link
Collaborator Author

CISC commented Jul 28, 2025

@compilade gentle ping

@deiteris
Copy link

Any progress on this?

@CISC CISC requested a review from compilade August 16, 2025 17:29
@CISC
Copy link
Collaborator Author

CISC commented Aug 26, 2025

@compilade @slaren ping, would be nice to get this merged soon...

@CISC CISC requested review from slaren and ggerganov August 27, 2025 10:03
@CISC CISC added the model Model specific label Aug 28, 2025
@CISC CISC merged commit 84ab83c into master Aug 28, 2025
51 of 52 checks passed
@CISC CISC deleted the cisc/jina-embeddings-v3 branch August 28, 2025 13:49
@CISC
Copy link
Collaborator Author

CISC commented Aug 28, 2025

Sigh, forgot to remove [no ci] from comments again...

@Binozo
Copy link

Binozo commented Aug 28, 2025

Thank you so much! @CISC

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)
@gabe-l-hart gabe-l-hart mentioned this pull request Aug 28, 2025
1 task
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…upport

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)

Signed-off-by: Gabe Goodhart <[email protected]>
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 29, 2025
* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples model Model specific python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: add jina embeddings model availible convert to gguf Feature Request: Support Jina V3 arch
7 participants