Skip to content

MODEL: Falcon-H1 support #14238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 105 commits into
base: master
Choose a base branch
from
Draft

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Jun 17, 2025

What does this PR do?

Fixes: #13681

Built on top of #13979 from @gabe-l-hart and #9126 from @compilade and on top of https://github.com/tiiuae/llama.cpp-Falcon-H1 from @HDElectronics @IbrahimFarhat & @HamzaYousLM

This PR adds support for Falcon-H1 architecture into llama.cpp - right now putting it as draft since #13979 and #9126 need to be merged first. Also minor things need to be addressed before merging this PR which I will leave as comments here

@ggerganov @compilade @gabe-l-hart

* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
gabe-l-hart and others added 22 commits June 16, 2025 15:18
…empt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
No longer needed now that unified isn't also supporting recurrent

ggml-org#13979 (comment)

Branch: HybridRecurrentCache
Now that it's not used at all in the unified cache, we don't need to use
the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
This is no longer needed now that there are separate implementations

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
…y state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
…ntion pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per-
layer components are created for attention layers and recurrent layers. The
main changes are:

- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/
llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/
llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to
llm_graph_input_attn_kv_unified
- Add a build_attn override that takes
llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

This makes the two paradigms fully consistent. The main drawback is the
code duplication in the build_attn and build_rs implementations where the
only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
Since initially writing this PR, the logic in the child state types changed
such that using the "init full" signature and keeping the ubatches on the
parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
…ostic

This reduces the code duplication between the different build_rs impls and
also retains a similar signature to the previous build_recurrent_state
method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/compilade/mamba2: (27 commits)
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
ggml : fix mamba2 ssm scan when compiled with SVE
graph : fix recurrent state copies when avoiding copies
kv-cache : allow context shift for recurrent models
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : remove const_cast when setting inputs for s_copy
metal : single-user mamba2 inference works
metal : add missing args for nb references in ssm_scan_f32_group
metal : fix confusion between ; and ,
convert : fix flake8 lint
ggml : avoid multiply by D in GGML_OP_SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
metal : fix wrong number of tokens per sequence in SSM_SCAN
metal : fix SSM_SCAN state head offset
metal : add back n_seqs to SSM_SCAN args
metal : remove unused arguments for SSM_SCAN
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : fix SSM_SCAN pipeline scope
metal : attempt to adapt SSM_SCAN for Mamba-2
llama : avoid redundant state copy for Mamba 1 and 2
...
kv_attn(new llama_kv_cache_unified(
model,
attn_filter == nullptr ?
[&](int32_t il) { return model.hparams.recurrent_layer(il); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is should be changed to something else (right now this condition is a no-op)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this was that in create_memory, you can implement a custom construction of llama_kv_cache_hybrid_recurrent that passes custom filters which will hit this case and take precedence over the defaults that filter by hparams.recurrent_layer.

@github-actions github-actions bot added testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 17, 2025
std::fill(
hparams.recurrent_layer_arr.begin(),
hparams.recurrent_layer_arr.end(),
true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic to change here

@gabe-l-hart
Copy link
Contributor

Great to see this @younesbelkada! I've added a few comments to the draft targeting my sync branch (gabe-l-hart#1 (review)) for anyone watching this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Falcon-H1
6 participants