Feature Request: Support Falcon Mamba 7B #9009

devlux76 · 2024-08-12T16:29:58Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Please support Falcon Mamba 7B from TII (Technology Innovation Institute TII - UAE)

Motivation

Support for all models is helpful.

My acid test for whether a model will run is to try and make a quant using "gruff my repo".

Admittedly it is hot off the presses yet it ought to run at least in theory, but it doesn't.

Error: Error converting to fp16: b'INFO:hf-to-gguf:Loading model: falcon-mamba-7b\nERROR:hf-to-gguf:Model FalconMambaForCausalLM is not supported\n'

Possible Implementation

They discuss an implementation here: https://falconllm.tii.ae/tii-releases-first-sslm-with-falcon-mamba-7b.html

Any functional mamba or mamba 2 models would be great, but this one is slightly changed.

The text was updated successfully, but these errors were encountered:

Zibri · 2024-08-14T13:50:07Z

I agree.please support it. https://huggingface.co/tiiuae/falcon-mamba-7b-instruct/tree/main

compilade · 2024-08-15T00:06:56Z

This is a Mamba model (not Mamba 2), so it should be relatively easy to adapt the existing Mamba support. The only difference I see is that FalconMambaForCausalLM seems to use RMS norms for dt, B and C while the original Mamba does not.

Since these norms don't use any learned parameters, this will require either a new metadata field to signal that norms for dt, B and C are used, or a new architecture could be added (e.g. falcon_mamba), but I would prefer to re-use the existing mamba one because everything is the same between these two except for the extra norms.

younesbelkada · 2024-08-21T09:32:58Z

Converted models can be found here: https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a

MoonRide303 · 2024-08-22T20:14:56Z

Conversion (using convert_hf_to_gguf.py --outtype q8_0 ..\falcon-mamba-7b-instruct\ --outfile falcon-mamba-7b-instruct-Q8_0.gguf from b3616) worked fine.

Trying to run the server (b3616, too) with llama-server.exe -v -ngl 99 -m falcon-mamba-7b-instruct-Q8_0.gguf -c 8192 command fails with llama.cpp\ggml\src\ggml-cuda\norm.cu:212: GGML_ASSERT(ggml_is_contiguous(src0)) failed.

Same problem when trying to use official GGUF from https://huggingface.co/tiiuae/falcon-mamba-7b-instruct-Q8_0-GGUF repo.

compilade · 2024-08-22T20:22:48Z

@MoonRide303

Oh, right. When split, dt, B and C are not contiguous.

I think the fix might be something like this:

diff --git a/src/llama.cpp b/src/llama.cpp
index bd7f1508..57fa5450 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9118,9 +9118,9 @@ static struct ggml_tensor * llm_build_mamba(
 
         // Some Mamba variants (e.g. FalconMamba) apply RMS norm in B, C & Dt layers
         if (ssm_dt_b_c_rms) {
-            dt = ggml_rms_norm(ctx, dt, norm_rms_eps);
-            B = ggml_rms_norm(ctx, B, norm_rms_eps);
-            C = ggml_rms_norm(ctx, C, norm_rms_eps);
+            dt = ggml_rms_norm(ctx, ggml_cont(ctx, dt), norm_rms_eps);
+            B = ggml_rms_norm(ctx, ggml_cont(ctx, B), norm_rms_eps);
+            C = ggml_rms_norm(ctx, ggml_cont(ctx, C), norm_rms_eps);
         }
 
         // {dt_rank, d_inner} @ {dt_rank, n_seq_tokens, n_seqs} => {d_inner, n_seq_tokens, n_seqs}

Either that, or making CUDA support doing norms when ne[1] is not contiguous. (by the way, GPU support for Mamba-specific ggml operators is not yet implemented and I don't know if it works anyway (it might, or not))

The CPU-only build does not have this problem, because the CPU-only implementation of ggml_rms_norm supports non-contiguous tensors. (this is why this problem was not noticed earlier)

MoonRide303 · 2024-08-22T21:50:09Z

@MoonRide303

Oh, right. When split, dt, B and C are not contiguous.

I think the fix might be something like this:
diff --git a/src/llama.cpp b/src/llama.cpp
index bd7f1508..57fa5450 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9118,9 +9118,9 @@ static struct ggml_tensor * llm_build_mamba(
 
         // Some Mamba variants (e.g. FalconMamba) apply RMS norm in B, C & Dt layers
         if (ssm_dt_b_c_rms) {
-            dt = ggml_rms_norm(ctx, dt, norm_rms_eps);
-            B = ggml_rms_norm(ctx, B, norm_rms_eps);
-            C = ggml_rms_norm(ctx, C, norm_rms_eps);
+            dt = ggml_rms_norm(ctx, ggml_cont(ctx, dt), norm_rms_eps);
+            B = ggml_rms_norm(ctx, ggml_cont(ctx, B), norm_rms_eps);
+            C = ggml_rms_norm(ctx, ggml_cont(ctx, C), norm_rms_eps);
         }
 
         // {dt_rank, d_inner} @ {dt_rank, n_seq_tokens, n_seqs} => {d_inner, n_seq_tokens, n_seqs}
Either that, or making CUDA support doing norms when ne[1] is not contiguous. (by the way, GPU support for Mamba-specific ggml operators is not yet implemented and I don't know if it works anyway (it might, or not))

The CPU-only build does not have this problem, because the CPU-only implementation of ggml_rms_norm supports non-contiguous tensors. (this is why this problem was not noticed earlier)

With this change it ends with llama.cpp\ggml\src\ggml-cuda\norm.cu:156: GGML_ASSERT(ncols % WARP_SIZE == 0) failed.

If GPU support for Mamba architecture is still not implemented, then maybe this issue shouldn't be closed, yet? Most pople would like to use GPU rather than CPU, I think.

compilade · 2024-08-22T21:51:58Z

@MoonRide303, there is already #6758 which tracks GPU support for Mamba.

stereoplegic · 2024-12-07T00:00:35Z

Not Falcon Mamba, but there is also https://huggingface.co/TRI-ML/mamba-7b-rw/ (7b Mamba trained on RefinedWeb but using same config, GPT-NeoX tokenizer, vocab as OG Mamba).

devlux76 added the enhancement New feature or request label Aug 12, 2024

jeroen-mostert mentioned this issue Aug 15, 2024

Feature Request: please add falcon 7b mamba support #9048

Closed

4 tasks

younesbelkada mentioned this issue Aug 18, 2024

Feat: Support for falcon-mamba architecture #9074

Merged

4 tasks

ggerganov closed this as completed in #9074 Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support Falcon Mamba 7B #9009

Feature Request: Support Falcon Mamba 7B #9009

devlux76 commented Aug 12, 2024

Zibri commented Aug 14, 2024

compilade commented Aug 15, 2024

younesbelkada commented Aug 21, 2024

MoonRide303 commented Aug 22, 2024

compilade commented Aug 22, 2024 •

edited

Loading

MoonRide303 commented Aug 22, 2024

compilade commented Aug 22, 2024

stereoplegic commented Dec 7, 2024

Feature Request: Support Falcon Mamba 7B #9009

Feature Request: Support Falcon Mamba 7B #9009

Comments

devlux76 commented Aug 12, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

Zibri commented Aug 14, 2024

compilade commented Aug 15, 2024

younesbelkada commented Aug 21, 2024

MoonRide303 commented Aug 22, 2024

compilade commented Aug 22, 2024 • edited Loading

MoonRide303 commented Aug 22, 2024

compilade commented Aug 22, 2024

stereoplegic commented Dec 7, 2024

compilade commented Aug 22, 2024 •

edited

Loading