granite embedding small support (ModernBert arch) #15641

ryan-mangeno · 2025-08-28T17:03:43Z

adding support to run granite embedding small, and it primarily pulls the modern bert architecture - https://huggingface.co/ibm-granite/granite-embedding-small-english-r2, currently working on it still, havent figured out the pre-tokenizer type or if I need to impliment it, also for the ubatch size the assert fails in llama-graph.cpp, hacked it to accept ubatch size of 1 for testing, but it seems to keep failing there and not sure why,

if I comment out of the line in llama-graph.cpp

assert(!ubatch.equal_seqs());

then it works

…orted yet but working on getting conversion to work for encoder only

…ated gate split with views, GEGLU is now used which does exactly this

…when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more

ryan-mangeno · 2025-08-28T17:12:46Z

@gabe-l-hart thanks in advance :)

ryan-mangeno · 2025-08-28T17:14:13Z

@gabe-l-hart thanks in advance :)

also realizing this a little late haha, but should I be changing all of the modern bert stuff to a granite embedding macro like LLM_ARCH_GRANITE_EMBD or keep it as is

CISC · 2025-08-28T17:14:43Z

You may want to check out an earlier attempt at ModernBert in #14014

gabe-l-hart · 2025-08-28T17:19:26Z

Thanks for getting this together @ryan-mangeno and thanks for pointing out the previous work @CISC. Ryan, let me know if/when you've looked over that PR and found anything to fix and I'll take a pass at review.

gabe-l-hart · 2025-08-28T17:21:42Z

also realizing this a little late haha, but should I be changing all of the modern bert stuff to a granite embedding macro like LLM_ARCH_GRANITE_EMBD or keep it as is

In general, we want to keep things as generic as possible, so since this uses the ModernBertModel architecture from transformers, it's best to keep the implementation here similarly robust unless there's a concrete reason to subset the transformers architecture to just work for granite (eg there's some non-trivial code path in the transformers version that would make sense as a separate architecture).

ryan-mangeno · 2025-08-28T19:15:45Z

Thanks for getting this together @ryan-mangeno and thanks for pointing out the previous work @CISC. Ryan, let me know if/when you've looked over that PR and found anything to fix and I'll take a pass at review.

will do

…ecking out the rest

ryan-mangeno · 2025-09-03T17:49:33Z

@gabe-l-hart im looking into modern berts research paper, I cant find a mention of symmetric sliding window attention but rather local sliding window attention so I am going to opt to use LLAMA_SWA_TYPE_LOCAL versus LLAMA_SWA_TYPE_SYMMETRIC used in the previous attempt. It also uses global attention every third layer so I am going to implement this stuff and then it should be ready for a review :)

gabe-l-hart · 2025-09-03T18:12:37Z

@ryan-mangeno That sounds good! I haven't unpacked any of those mechanics myself, but can try to get into it if you get stuck.

… per previous attempt, added local sliding window attention that alternates every third layer

ryan-mangeno · 2025-09-03T18:37:29Z

@ryan-mangeno That sounds good! I haven't unpacked any of those mechanics myself, but can try to get into it if you get stuck.

ok 👍 , made some changes but not sure if its fully ready yet, I will ping you when I think its ready if thats ok

ryan-mangeno · 2025-09-04T22:24:42Z

status update - I found out that modern bert uses an alternating rope method , per https://arxiv.org/pdf/2412.13663

In ModernBERT, every third layer employs global
attention with a RoPE theta of 160,000 and the
remaining layers use a 128 token, local sliding window attention with a RoPE theta of 10,000.

I am currently figuring out how to implement this

ryan-mangeno · 2025-09-12T16:03:31Z

@gabe-l-hart I believe this should be ready for review whenever your available to check it out :)

gabe-l-hart · 2025-09-12T17:11:18Z

Awesome, thanks for your hard work on this @ryan-mangeno . I'll look it over soon!

…rope_freq_base_train_swa were the same and i set them to correct values

gabe-l-hart · 2025-09-12T21:55:01Z

@ryan-mangeno Two requests:

Can you merge in master and resolve the conflicts (I can help if you get stuck)
Can you share what you've been doing to compare outputs between this version and transformers?

ryan-mangeno · 2025-09-13T18:25:51Z

@ryan-mangeno Two requests:

Can you merge in master and resolve the conflicts (I can help if you get stuck)

Can you share what you've been doing to compare outputs between this version and transformers?

yes will get on that 👍

ryan-mangeno · 2025-09-13T19:02:10Z

@ryan-mangeno Two requests:

Can you merge in master and resolve the conflicts (I can help if you get stuck)

Can you share what you've been doing to compare outputs between this version and transformers?

yes will get on that 👍

here is the command I run on llama.cpp

./build/bin/llama-embedding \
    -m models/modernbert.gguf \
    -p "hello world" \
    --temp 0.0 \
    --repeat_penalty 1.0 \
    --top_k 0 \
    --top_p 1.0 \

and here is my script for hf

import torch
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(0) 
torch.use_deterministic_algorithms(True)  
model_path = "ibm-granite/granite-embedding-small-english-r2"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)
model.eval()

input_queries = ["hello world"]

tokenized_queries = tokenizer(
    input_queries,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**tokenized_queries)
    embedding = outputs.last_hidden_state[:, 0, :]  # CLS token

print("Embedding shape:", embedding.shape)
print("Embedding vector:", embedding)

ryan-mangeno · 2025-09-13T20:48:21Z

I also have a script for the cosine similarity between the two resulting emebeddings i get,

import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    
    norm_v1 = np.linalg.norm(vec1)
    norm_v2 = np.linalg.norm(vec2)
    
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    
    similarity = dot_product / (norm_v1 * norm_v2)
    
    return similarity

hf_embds = np.array(<copy and paste tensor from hf output>)
llama_data_string = "< llama prints emebeddings without comma seperators so treat it as a string then split >"
llama_embds = np.array([float(i) for i in llama_data_string.split()])

print(cosine_similarity(llama_embds, hf_embds))

it currently prints

0.0502

so pretty low similarlity at its face value, still working through it and hoping to get better results

ryan-mangeno · 2025-09-26T15:48:46Z

Just an update, I think I might be getting bad results because I did not implement flash attention which is outlined in the modern bert research paper, I will try to update this

ryan-mangeno · 2025-09-26T16:07:39Z

Just an update, I think I might be getting bad results because I did not implement flash attention which is outlined in the modern bert research paper, I will try to update this

found out flash attention is a flag you can pass in when running model, results still not great so will keep trying to hack at it.

…_TYPE_LOCAL

…1, 4, 7 ...

ryan-mangeno · 2025-10-01T20:16:17Z

to my knowledge since modern bert is an encoder that I shouldnt be using a kv cache and use,

auto * inp_attn = build_attn_inp_no_cache();

during the graph builld, but since modern bert uses swa, when input is set during

void llm_graph_input_attn_no_cache::set_input(const llama_ubatch * ubatch)

this assert fails, and I am not really too sure how long this will take to implement if this a crucial step to the current implementation of modern bert

    GGML_ASSERT(hparams.swa_type == LLAMA_SWA_TYPE_NONE && "TODO: implement");

ggerganov · 2025-10-02T07:18:20Z

SWA support for cache-less context is not ready yet. For now use a SWA cache similar to llm_build_gemma_embedding_iswa and add a TODO to be fixed later.

ryan-mangeno · 2025-10-04T15:49:19Z

SWA support for cache-less context is not ready yet. For now use a SWA cache similar to llm_build_gemma_embedding_iswa and add a TODO to be fixed later.

ok will do, thank you so much!!

… embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion

ryan-mangeno added 14 commits August 21, 2025 12:38

constants and tensor mappings for modern bert support, model not supp…

6151592

…orted yet but working on getting conversion to work for encoder only

conversion now working, hf -> gguf

6643c5a

working on support, now working on building graph

ac67fc6

some cleanup

cc40378

cleanup

41b6864

continuing

cc3d7ab

correct tensor shape for qkv

4ceb828

fixed tensor mappings and working on buildin graph

18c0c23

tensor debugging now works -> (llama-eval-callback), instead of simul…

bffe3c9

…ated gate split with views, GEGLU is now used which does exactly this

cleanup

8f32843

cleanup

9805635

cleanup

40249dd

more cleanup

853f344

ubatch issues, the assert for checking equal seqs in llama-graph.cpp …

2a1c750

…when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more

ryan-mangeno marked this pull request as draft August 28, 2025 17:05

github-actions bot added the python python script changes label Aug 28, 2025

ryan-mangeno added 2 commits August 29, 2025 12:15

added cls token per previous modern bert attempt, still working on ch…

c73eb68

…ecking out the rest

fixed pre tokenizer and still working through previous pr

ca353d3

ryan-mangeno added 2 commits September 3, 2025 14:32

working through previous attemp, implimented more accurate conversion…

6d86944

… per previous attempt, added local sliding window attention that alternates every third layer

fixed pre tokenizer

39c0291

ryan-mangeno added 3 commits September 11, 2025 16:41

fixed asser for equal ubatch seq

4e7c879

cleanup

20d448a

added mask check in vocab

db4f565

gabe-l-hart marked this pull request as ready for review September 12, 2025 17:25

gabe-l-hart self-requested a review September 12, 2025 17:27

fixed alternating rope, the hparams.rope_freq_base_train and hparams.…

da0604a

…rope_freq_base_train_swa were the same and i set them to correct values

ryan-mangeno added 2 commits September 13, 2025 14:28

reuse variable

43a2980

fixed merge conflicts and added print debug check for swa type

e368442

removed repeat

7036cc8

ryan-mangeno and others added 3 commits September 14, 2025 14:47

merge fixes

2522ce8

Merge branch 'master' into modern-bert-support

e043815

Merge branch 'master' into modern-bert-support

35667f2

standard swa method can be used instead of a new enum being LLAMA_SWA…

3cdd650

…_TYPE_LOCAL

ryan-mangeno requested a review from CISC as a code owner September 26, 2025 18:12

ryan-mangeno added 4 commits October 1, 2025 14:07

merge

86adde6

merge

46f2182

correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of …

33eed31

…1, 4, 7 ...

more modular hparam setting

61a0b03

replaced attn out norm with ffn_norm and cosine similarity between hf…

3bbf671

… embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion

granite embedding small support (ModernBert arch) #15641

Are you sure you want to change the base?

granite embedding small support (ModernBert arch) #15641

Conversation

ryan-mangeno commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-mangeno commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-mangeno commented Aug 28, 2025

Uh oh!

CISC commented Aug 28, 2025

Uh oh!

gabe-l-hart commented Aug 28, 2025

Uh oh!

gabe-l-hart commented Aug 28, 2025

Uh oh!

ryan-mangeno commented Aug 28, 2025

Uh oh!

ryan-mangeno commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Sep 3, 2025

Uh oh!

ryan-mangeno commented Sep 3, 2025

Uh oh!

ryan-mangeno commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-mangeno commented Sep 12, 2025

Uh oh!

gabe-l-hart commented Sep 12, 2025

Uh oh!

gabe-l-hart commented Sep 12, 2025

Uh oh!

ryan-mangeno commented Sep 13, 2025

Uh oh!

ryan-mangeno commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-mangeno commented Sep 13, 2025

Uh oh!

ryan-mangeno commented Sep 26, 2025

Uh oh!

ryan-mangeno commented Sep 26, 2025

Uh oh!

ryan-mangeno commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

ryan-mangeno commented Oct 4, 2025

Uh oh!

Uh oh!

ryan-mangeno commented Aug 28, 2025 •

edited

Loading

ryan-mangeno commented Aug 28, 2025 •

edited

Loading

ryan-mangeno commented Sep 3, 2025 •

edited

Loading

ryan-mangeno commented Sep 4, 2025 •

edited

Loading

ryan-mangeno commented Sep 13, 2025 •

edited

Loading