MODEL: Falcon-H1 support #14238

younesbelkada · 2025-06-17T11:36:40Z

What does this PR do?

Built on top of #13979 from @gabe-l-hart and #9126 from @compilade and on top of https://github.com/tiiuae/llama.cpp-Falcon-H1 from @HDElectronics @IbrahimFarhat & @HamzaYousLM

This PR adds support for Falcon-H1 architecture into llama.cpp - right now putting it as draft since #13979 and #9126 need to be merged first. Also minor things need to be addressed before merging this PR which I will leave as comments here

@ggerganov @compilade @gabe-l-hart

* ggml : improve ggml_mul speed when masking recurrent states

* ggml : make the ggml_mul fast broadcast path more consistently formatted

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

@younesbelkada

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: @younesbelkada

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

…ostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...

…inference running

younesbelkada · 2025-06-17T11:37:46Z

src/llama-kv-cache-hybrid-recurrent.cpp

+    kv_attn(new llama_kv_cache_unified(
+        model,
+        attn_filter == nullptr ?
+            [&](int32_t il) { return model.hparams.recurrent_layer(il); }


This is should be changed to something else (right now this condition is a no-op)

The idea of this was that in create_memory, you can implement a custom construction of llama_kv_cache_hybrid_recurrent that passes custom filters which will hit this case and take precedence over the defaults that filter by hparams.recurrent_layer.

younesbelkada · 2025-06-17T11:38:12Z

src/llama-model.cpp

+    std::fill(
+        hparams.recurrent_layer_arr.begin(),
+        hparams.recurrent_layer_arr.end(),
+        true);


Logic to change here

gabe-l-hart · 2025-06-17T15:51:29Z

Great to see this @younesbelkada! I've added a few comments to the draft targeting my sync branch (gabe-l-hart#1 (review)) for anyone watching this PR.

compilade added 30 commits August 21, 2024 18:00

llama : initial Mamba-2 support

1f0fea7

ggml : SIMD ggml_ssm_scan for Mamba-2

dceff23

* ggml : improve ggml_mul speed when masking recurrent states

llama : support running Mamba-Codestral-7B-v0.1

2bfe9de

llama : fix Mamba-2 conv state saving

aff9692

* ggml : make the ggml_mul fast broadcast path more consistently formatted

llama : remove unused variable

e04910d

llama : add missing break

fa358e7

convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

38913dc

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

Merge branch 'master' into compilade/mamba2

0e601ca

llama : avoid redundant state copy for Mamba 1 and 2

273e7a4

Merge branch 'master' into compilade/mamba2

7d6cb36

metal : attempt to adapt SSM_SCAN for Mamba-2

2c77d79

metal : fix SSM_SCAN pipeline scope

87b97d0

metal : use log and exp instead of log1pf and expf in SSM_SCAN

03d0e6e

metal : remove unused arguments for SSM_SCAN

7a351ab

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

8b15bc6

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset

5b8ec2b

metal : fix wrong number of tokens per sequence in SSM_SCAN

62b09b3

Merge branch 'master' into compilade/mamba2

038d958

ggml : remove unused fast broadcast path in GGML_MUL

805512a

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

Merge branch 'master' into compilade/mamba2

7d16e1b

Merge branch 'master' into compilade/mamba2

8d8f065

convert : fix flake8 lint

b4e9c59

Merge branch 'master' into compilade/mamba2

1ee6c48

Merge branch 'master' into compilade/mamba2

c9ecf62

Merge branch 'master' into compilade/mamba2

35d06fa

metal : fix confusion between ; and ,

cf4f0a4

metal : add missing args for nb references in ssm_scan_f32_group

6def5cd

metal : single-user mamba2 inference works

791998b

kv-cache : remove const_cast when setting inputs for s_copy

94c3d53

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

gabe-l-hart and others added 22 commits June 16, 2025 15:18

fix: Remove errant virtual destructor leftover from previous impl att…

ae7a02e

…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

1e49029

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

refactor: Remove n_embd_k/v_s from unified cache

ddc85f0

No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache

refactor: Remove layer index from n_embd_k/v_s

368c9f4

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

refactor: Remove n_embd_k/v_gqa from recurrent cache

b82a225

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

fix: Remove logits_all after rebase

3ee2222

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

fix: Remove llama_model_is_hybrid_Recurrent public API

bd37fc8

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

refactor: Use llama_memory_state_ptr for child states in hybrid memor…

2e5e45c

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]>

fix: Fix resize vs reserve and skip null tensors in size computation

beddd62

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: @younesbelkada

Merge remote-tracking branch 'gabe/mamba2-sync' into add-fh1-clean + …

9d30fd4

…inference running

more clean ups

aff19ae

fix inference

9150fc9

Commit credits attribution

11857e8

Commit credits attribution

ebfcb6f

Commit credits attribution

82f59d0

add convert script

837be3e

add missing elements in py files

41f25cf

younesbelkada commented Jun 17, 2025

View reviewed changes

github-actions bot added testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 17, 2025

younesbelkada commented Jun 17, 2025

View reviewed changes

younesbelkada mentioned this pull request Jun 17, 2025

Feature Request: Falcon-H1 #13681

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MODEL: Falcon-H1 support #14238

MODEL: Falcon-H1 support #14238

Uh oh!

younesbelkada commented Jun 17, 2025 •

edited

Loading

Uh oh!

younesbelkada Jun 17, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

younesbelkada Jun 17, 2025

Uh oh!

gabe-l-hart commented Jun 17, 2025

Uh oh!

Uh oh!

MODEL: Falcon-H1 support #14238

Are you sure you want to change the base?

MODEL: Falcon-H1 support #14238

Uh oh!

Conversation

younesbelkada commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

younesbelkada Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

younesbelkada Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Jun 17, 2025

Uh oh!

Uh oh!

younesbelkada commented Jun 17, 2025 •

edited

Loading