Skip to content

Feature Request: Support Falcon Mamba 7B #9009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
devlux76 opened this issue Aug 12, 2024 · 8 comments · Fixed by #9074
Closed
4 tasks done

Feature Request: Support Falcon Mamba 7B #9009

devlux76 opened this issue Aug 12, 2024 · 8 comments · Fixed by #9074
Labels
enhancement New feature or request

Comments

@devlux76
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Please support Falcon Mamba 7B from TII (Technology Innovation Institute TII - UAE)

Motivation

Support for all models is helpful.

My acid test for whether a model will run is to try and make a quant using "gruff my repo".

Admittedly it is hot off the presses yet it ought to run at least in theory, but it doesn't.

Error: Error converting to fp16: b'INFO:hf-to-gguf:Loading model: falcon-mamba-7b\nERROR:hf-to-gguf:Model FalconMambaForCausalLM is not supported\n'

Possible Implementation

They discuss an implementation here: https://falconllm.tii.ae/tii-releases-first-sslm-with-falcon-mamba-7b.html

Any functional mamba or mamba 2 models would be great, but this one is slightly changed.

@devlux76 devlux76 added the enhancement New feature or request label Aug 12, 2024
@Zibri
Copy link

Zibri commented Aug 14, 2024

I agree.please support it. https://huggingface.co/tiiuae/falcon-mamba-7b-instruct/tree/main

@compilade
Copy link
Collaborator

This is a Mamba model (not Mamba 2), so it should be relatively easy to adapt the existing Mamba support. The only difference I see is that FalconMambaForCausalLM seems to use RMS norms for dt, B and C while the original Mamba does not.

Since these norms don't use any learned parameters, this will require either a new metadata field to signal that norms for dt, B and C are used, or a new architecture could be added (e.g. falcon_mamba), but I would prefer to re-use the existing mamba one because everything is the same between these two except for the extra norms.

@younesbelkada
Copy link
Contributor

Converted models can be found here: https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a

@MoonRide303
Copy link
Contributor

Conversion (using convert_hf_to_gguf.py --outtype q8_0 ..\falcon-mamba-7b-instruct\ --outfile falcon-mamba-7b-instruct-Q8_0.gguf from b3616) worked fine.

Trying to run the server (b3616, too) with llama-server.exe -v -ngl 99 -m falcon-mamba-7b-instruct-Q8_0.gguf -c 8192 command fails with llama.cpp\ggml\src\ggml-cuda\norm.cu:212: GGML_ASSERT(ggml_is_contiguous(src0)) failed.

Same problem when trying to use official GGUF from https://huggingface.co/tiiuae/falcon-mamba-7b-instruct-Q8_0-GGUF repo.

@compilade
Copy link
Collaborator

compilade commented Aug 22, 2024

@MoonRide303

Oh, right. When split, dt, B and C are not contiguous.

I think the fix might be something like this:

diff --git a/src/llama.cpp b/src/llama.cpp
index bd7f1508..57fa5450 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9118,9 +9118,9 @@ static struct ggml_tensor * llm_build_mamba(
 
         // Some Mamba variants (e.g. FalconMamba) apply RMS norm in B, C & Dt layers
         if (ssm_dt_b_c_rms) {
-            dt = ggml_rms_norm(ctx, dt, norm_rms_eps);
-            B = ggml_rms_norm(ctx, B, norm_rms_eps);
-            C = ggml_rms_norm(ctx, C, norm_rms_eps);
+            dt = ggml_rms_norm(ctx, ggml_cont(ctx, dt), norm_rms_eps);
+            B = ggml_rms_norm(ctx, ggml_cont(ctx, B), norm_rms_eps);
+            C = ggml_rms_norm(ctx, ggml_cont(ctx, C), norm_rms_eps);
         }
 
         // {dt_rank, d_inner} @ {dt_rank, n_seq_tokens, n_seqs} => {d_inner, n_seq_tokens, n_seqs}

Either that, or making CUDA support doing norms when ne[1] is not contiguous. (by the way, GPU support for Mamba-specific ggml operators is not yet implemented and I don't know if it works anyway (it might, or not))

The CPU-only build does not have this problem, because the CPU-only implementation of ggml_rms_norm supports non-contiguous tensors. (this is why this problem was not noticed earlier)

@MoonRide303
Copy link
Contributor

@MoonRide303

Oh, right. When split, dt, B and C are not contiguous.

I think the fix might be something like this:

diff --git a/src/llama.cpp b/src/llama.cpp
index bd7f1508..57fa5450 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9118,9 +9118,9 @@ static struct ggml_tensor * llm_build_mamba(
 
         // Some Mamba variants (e.g. FalconMamba) apply RMS norm in B, C & Dt layers
         if (ssm_dt_b_c_rms) {
-            dt = ggml_rms_norm(ctx, dt, norm_rms_eps);
-            B = ggml_rms_norm(ctx, B, norm_rms_eps);
-            C = ggml_rms_norm(ctx, C, norm_rms_eps);
+            dt = ggml_rms_norm(ctx, ggml_cont(ctx, dt), norm_rms_eps);
+            B = ggml_rms_norm(ctx, ggml_cont(ctx, B), norm_rms_eps);
+            C = ggml_rms_norm(ctx, ggml_cont(ctx, C), norm_rms_eps);
         }
 
         // {dt_rank, d_inner} @ {dt_rank, n_seq_tokens, n_seqs} => {d_inner, n_seq_tokens, n_seqs}

Either that, or making CUDA support doing norms when ne[1] is not contiguous. (by the way, GPU support for Mamba-specific ggml operators is not yet implemented and I don't know if it works anyway (it might, or not))

The CPU-only build does not have this problem, because the CPU-only implementation of ggml_rms_norm supports non-contiguous tensors. (this is why this problem was not noticed earlier)

With this change it ends with llama.cpp\ggml\src\ggml-cuda\norm.cu:156: GGML_ASSERT(ncols % WARP_SIZE == 0) failed.

If GPU support for Mamba architecture is still not implemented, then maybe this issue shouldn't be closed, yet? Most pople would like to use GPU rather than CPU, I think.

@compilade
Copy link
Collaborator

@MoonRide303, there is already #6758 which tracks GPU support for Mamba.

@stereoplegic
Copy link

Not Falcon Mamba, but there is also https://huggingface.co/TRI-ML/mamba-7b-rw/ (7b Mamba trained on RefinedWeb but using same config, GPT-NeoX tokenizer, vocab as OG Mamba).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants