mtmd : Support Pixtral 12B #13065

ngxson · 2025-04-22T14:20:03Z

Pre-quantized GGUF: https://huggingface.co/ggml-org/pixtral-12b-GGUF

llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF

To convert the mmproj yourself, use convert_hf_to_gguf.py with --mmproj flag

Demo

$ llama-mtmd-cli -m ../models/pixtral-12b/model.gguf \
  --mmproj ../models/pixtral-12b/mmproj-model.gguf \
  --image ../models/eiffel-tower.jpg -p "what do you see" -c 8000

Result:

The image showcases the Eiffel Tower, a renowned landmark in Paris, France. The tower stands tall and prominent against a backdrop of a partly cloudy sky, with the sun casting a warm glow, suggesting it might be either early morning or late afternoon. The structure of the Eiffel Tower is clearly visible, with its intricate iron latticework and distinct design.

Checklist (done)

Convert text + vision models to GGUF
cgraph added in clip.cpp
Placement of [IMG_BREAK] and [IMG_END] tokens
Implemented 2D-RoPE
Write a cool blog post about that 2D RoPE --> Check it out: https://blog.ngxson.com/very-simple-to-understand-rope-2drope-mrope
Support arbitrary (dynamic) image size 🚀

ngxson · 2025-04-22T22:23:19Z

examples/llava/clip.cpp

+    // for example, if we have a list of inv_freq: 1e-0, 1e-1, 1e-2, 1e-3
+    // first half will use 1e-0, 1e-2 (even)
+    // second half will use 1e-1, 1e-3 (odd)
+    // the trick here is to rotate just half of n_dim, so inv_freq will automatically be even
+    //  ^ don't ask me why, it's math! -2(2i) / n_dim == -2i / (n_dim/2))
+    // then for the second half, we use freq_scale to shift the inv_freq
+    //  ^ why? replace (2i) with (2i+1) in the above equation
+    const float freq_scale = std::pow(freq_base, (float)-2/n_dim);


on python implementation, this is done by firstly create a list of inv_freq, then select the odd/even positions by doing freqs[::2] and freqs[1::2]

here, with a bit of math, we can archive the same effect without even materializing the list of inv_freq in the first place

for even freq:
$$\text{freq\_base}^{\frac{-2*2i}{n_{\text{dim}}}} = \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2}}$$
and for odd:
$$\begin{align*} \text{freq\_base}^{\frac{-2(2i+1)}{n_{\text{dim}}}} &= \text{freq\_base}^{\frac{-2*2i-2}{n_{\text{dim}}}} \\ % Correct line break and alignment &= \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2} + \frac{-2}{n_{\text{dim}}}} \\ % Correct line break and alignment &= \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2}} \cdot \text{freq\_base}^{\frac{-2}{n_{\text{dim}}}} % Correct mathematical rule (product, not sum) and line break \end{align*}$$

@ggerganov @HimariO while working on this, I realized that maybe M-RoPE used by Qwen2VL can also be implemented using ggml_view and ggml_rope_ext_inplace

Please correct if I'm wrong, AFAIU main idea of M-RoPE is to apply the same set of inv_freq to different sections of the embedding vector (so it's not even using odd/even positions like what pixtral does, but even easier!)

For example, with an embd vector of 8 elements: aabbccdd and mrope_sections=[1, 1, 2]
The vector will be split into 3 sections: aa, bb, ccdd
If I have position [x, y, z], then they will be applied to aa, bb, ccdd respectively
So, this can be re-implemented using ggml_view to view the specific section. The view will be a non-contiguous tensor, so as long as ggml_rope_ext_inplace works on non-contiguous, we're still OK here

The problem with ggml_rope_multi is that it is not supported on all backends, so re-using ggml_rope_ext_inplace allowing it to be supported on all existing backends without addition works.

Yes, I was looking at this as well today and I think you are correct. I think for now your approach of using ropes + views is better until we gather some experience with vision models and make sure that everything works correctly. Eventually, we would want to implement a more tightly integrated implementation such as ggml_rope_multi that is supported by all backends for better performance, but this can be done at a later stage.

I think the approach you mentioned would work when GGML_ROPE_TYPE_VISION is used. However, applying ggml_rope_ext_inplace separately to each vector section will produce different results compared to ggml_rope_multi when using the GGML_ROPE_TYPE_MROPE mode, due to how theta is computed.

That said, it's been quite a while since I last looked into RoPE-related implementations, so it would be better to just write a simple script to verify this.

However, applying ggml_rope_ext_inplace separately to each vector section will produce different results compared to ggml_rope_multi when using the GGML_ROPE_TYPE_MROPE mode, due to how theta is computed.

(Just noting thing here, but no need to answer 😄)

I looked deeper into rope_multi and rope_vision today. For rope_multi, the calculation of theta that you mentioned can indeed be corrected by applying a freq_scale. Basically we shift the "offset" theta_base of each section the same way that I shift the even/odd frequency in the earlier comment in the thread.

But I agree that this is not practically possibly for 3 or 4 dims. The complicated part is not about how to calculate theta, but to "merge" the applied sections together. In my implementation, I use ggml_concat which will result in double memory usage.

That being said, if people invent more kinds of RoPE in the future, I think a RoPE kernel accepting these params should cover all use case. Let's say it's a ggml_rope_universal

n_dim

n_dim_offset

ordering --> neox style or normal style

So for example,

Qwen's 3D M-RoPE is just 3 ggml_rope_universal with different offset, different freq_scale and ordering=neox

Pixtral's 2D RoPE is 2 ggml_rope_universal with different offset, different freq_scale, ordering=normal

Llama 4 2D RoPE is 2 ggml_rope_universal with different offset, SAME freq_scale, ordering=normal

ngxson · 2025-04-23T16:35:48Z

Tests are also passed:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-qwen2vl-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M

ngxson · 2025-04-23T18:23:33Z

@bartowski1182 Time to make some quants 🚀

bartowski1182 · 2025-04-23T22:32:56Z

HUGE thank you for your changes to converting mmproj files... makes my life so much easier

https://huggingface.co/bartowski/mistral-community_pixtral-12b-GGUF

BugReporterZ · 2025-04-23T23:18:51Z

I'm getting strange responses with references to a grid-like pattern; it's as if images aren't being correctly tokenized. I'm using the quantizations from bartowski and the latest commit merging the pull request to master.

./build/bin/llama-mtmd-cli -m mistral-community_pixtral-12b-Q8_0.gguf --mmproj mmproj-mistral-community_pixtral-12b-f16.gguf -ngl 99 --image tour-eiffel.png -p "What do you see?" -c 8000

It seems like you're displaying a collection of images with a grid layout. The images are of various sizes and shapes, and they appear to be scattered randomly across the grid. The background color of the grid is a dark blue, which contrasts with the lighter colors of the images.

I've rebuilt llama.cpp with:

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON
cmake --build build --config Release -j 20

stduhpf · 2025-04-23T23:33:35Z

It seems like you're displaying a collection of images with a grid layout. The images are of various sizes and shapes, and they appear to be scattered randomly across the grid. The background color of the grid is a dark blue, which contrasts with the lighter colors of the images.

I get the exact same kind of glitched responses with Vulkan backend, sometimes it's "It appears that the image you provided is highly pixelated..." or "A mosaic of various shades...".

BugReporterZ · 2025-04-23T23:35:12Z

I'm using the CUDA backend:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5174 (56304069) with x86_64-conda-linux-gnu-cc (Anaconda gcc) 11.2.0 for x86_64-conda-linux-gnu

stduhpf · 2025-04-23T23:43:59Z

It works very well on CPU backend. Seems like there is an issue with the image encoding on GPU backends

BugReporterZ · 2025-04-23T23:50:40Z

It does seem to work if I prepend the command with CUDA_VISIBLE_DEVICES=-1, disabling the GPU.

The image depicts a nighttime scene of a large, intricate structure illuminated against a dark sky. The structure appears to be a complex network of interconnected geometric shapes, possibly made of wood or another similar material. The lighting highlights the various angles and patterns of the structure, creating a visually striking display.

In the foreground, there is a crowd of people gathered, suggesting that this might be an event or festival. The people are standing and observing the illuminated structure, indicating that it is likely a point of interest or attraction.

The overall atmosphere of the image is festive and communal, with the illuminated structure serving as a focal point for the gathering. The intricate design and lighting of the structure suggest that it could be part of an art installation or a temporary architectural piece designed for a specific event.

It's not seeing that as the Eiffel Tower, but it's seemingly recognizing general image features correctly.

stduhpf · 2025-04-24T00:03:29Z

It works very well on CPU backend. Seems like there is an issue with the image encoding on GPU backends

Nevermind, it only kinda works on CPU backend.

Exemple failure case

./buildCPU/bin/Release/llama-mtmd-cli.exe -m ./models/pixtral-12b-Q4_K_M.gguf --mmproj ./models/mmproj-pixtral-12b-f16.gguf --image ../images/output.png -p "Describe this image in detail." -c 8000

The image appears to be a digital art piece or a manipulated photograph that creates a distorted and abstract representation of a group of people. Here is a detailed description:

Subjects: The image features numerous figures, primarily of people. The figures are depicted from the back or side, showing their upper bodies and heads. They are wearing casual clothing, including t-shirts and shorts.

Color and Style: The color palette is muted and desaturated, with a predominant use of grays, blacks, and whites. The figures have a somewhat monochromatic appearance, with subtle hints of color in their hair and clothing.

Distortion: The image is heavily distorted, with the figures appearing fragmented and overlapping. This gives the impression of a kaleidoscopic or mirrored effect, where the figures repeat and blend into each other.

Background: The background is slightly blurred, showing what appears to be an outdoor setting with buildings and possibly a street. The buildings have a European architectural style, suggesting an urban environment.

Composition: The composition is dynamic, with the figures arranged in a way that creates a sense of movement and depth. The overlapping and repetition of the figures create a layered effect, adding to the complexity of the image.

Artistic Elements: The overall style of the image suggests an artistic or conceptual approach, possibly aiming to convey a sense of crowd, anonymity, or the human experience in an urban setting.

This detailed description should help you understand the content and style of the image.

ngxson · 2025-04-24T07:00:46Z

It's possible that my ggml_rope_ext_inplace hack to support 2D RoPE does not work on CUDA, hence the spatial position messed up and the model perceived bunch of weird patterns.

Testing more today to confirm.

@slaren I suspect that maybe the tensor is copied somewhere instead of keeping inplace, do you think so?

LostRuins · 2025-04-24T08:19:33Z

Hi @ngxson, is Pixtral supposed to be so token heavy?

Looking at https://huggingface.co/ggml-org/pixtral-12b-GGUF/tree/main?show_file_info=mmproj-pixtral-12b-f16.gguf, I see
clip.vision.patch_size = 16

and from the newly added code:

 if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) {
        int n_patches_x = img->nx / params.patch_size;
        int n_patches_y = img->ny / params.patch_size;
        n_patches = n_patches_y*n_patches_x + n_patches_y - 1; // + one [IMG_BREAK] per row, except the last row

Loading a 1024x1024px image, we get
n_patches_x = (1024/16) = 64
n_patches_y = (1024/16) = 64
n_patches = 64*64 + 64-1 = 4159

So we need 4159 tokens just to process a single 1024x1024 image? Trying with the updated mtmd-cli, and this seems to be what is used... Or are the images supposed to be downscaled first?

ggerganov · 2025-04-24T08:24:13Z

I did some more testing and it seems that there are issues also with the Metal and even CPU-only implementation. Resizing and cropping a problematic image can make it work correctly, but I don't see a specific pattern.

Here is a simple repro using Lenna:

# use original 512x512 image, (wrong answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna.png -p "How many people do you see in the image?" --top-k 1 

The image shows two people wearing large straw hats with feathers. The hats are prominently featured, and the individuals appear to be posing for the photograph. The background is blurred, which helps to emphasize the subjects in the foreground. The overall atmosphere of the image is warm and vibrant, possibly suggesting a sunny or tropical setting. The individuals expressions and postures are not clearly visible due to the angle and focus of the image.

# resize to 1024x1024
convert ~/lenna.png -resize 200% ~/lenna-1024.png

# use resized 1024x1024 image, (correct answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna-1024.png -p "How many people do you see in the image?" --top-k 1 

In the image, I see one person. The person is wearing a hat with a wide brim and decorative feathers. The background appears to be indoors with a warm, reddish tone. The person has long hair and is looking directly at the camera.

ngxson · 2025-04-24T08:50:29Z

So we need 4159 tokens just to process a single 1024x1024 image? Trying with the updated mtmd-cli, and this seems to be what is used... Or are the images supposed to be downscaled first?

@LostRuins 1024x1024 image = 1048576 pixels. So, how many tokens you expect it to use?

LostRuins · 2025-04-24T08:59:54Z

I was mostly comparing it to gemma3, which uses 256 tokens for a 896x896 image - and likewise, most of the other multimodal examples use at most between 1k-2k tokens at that resolution. Just an observation.

ngxson · 2025-04-24T09:07:14Z

gemma 3 and some other models using conv2d to compress the output token, it works because the image size is fixed.

pixtral and qwen2vl support dynamic size, so N embeddings from vision encoder will be projected as N tokens in text model.

ngxson · 2025-04-24T09:26:29Z

@ggerganov Ok thanks for testing, I confirm that I get the same output as yours in my test

One super weird thing though, if I use a Q8_0 mmproj file, the problem disappears:

The image shows a woman wearing a wide-brimmed hat with feathers. The hat appears to be made of straw or a similar material. The woman has long, dark hair that is partially visible beneath the hat. The background is blurred, but it seems to be an outdoor setting with some vertical structures, possibly wooden poles or beams. The overall tone of the image has a warm, vintage feel, possibly due to the color grading or filter applied.

The Q8_0 can be generated via: python convert_hf_to_gguf.py --outtype q8_0 --remote mistral-community/pixtral-12b --mmproj

ngxson · 2025-04-24T10:33:55Z

examples/llava/clip.cpp

+    {
+        tmp = ggml_view_3d(ctx0, cur,
+            n_dim/2, n_head, n_pos,
+            ggml_row_size(cur->type, n_dim),
+            ggml_row_size(cur->type, n_dim*n_head),
+            n_dim/2 * ggml_element_size(cur));
+        tmp = ggml_rope_ext_inplace(
+            ctx0,
+            tmp,
+            pos_w,      // positions
+            nullptr,    // freq factors
+            n_dim/2,    // n_dims
+            0, 0, freq_base,
+            freq_scale_odd,
+            0.0f, 1.0f, 0.0f, 0.0f
+        );
+        // calculate inplace (modify cur directly)
+        ggml_build_forward_expand(gf, tmp);
+    }


Ok after more tinkering, I discovered that ggml_rope_ext_inplace does not work very well on a non-contiguous tensor, I'm thinking of another hack, or maybe reuse ggml_rope_multi

@ggerganov the reason why it responses there are 2 people side-by-side was because this position embedding is incorrect, hence it perceives something like this (this is an illustration I made using Affinity Photo):

On CUDA backend, this code block make no changes to the cur tensor, hence why it says that it sees texture

* add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script

github-actions bot added examples python python script changes labels Apr 22, 2025

ngxson added 2 commits April 22, 2025 16:31

add pixtral text model (vision is wip)

8f16dcc

cgraph ok, just missing 2D RoPE

783de6e

ngxson force-pushed the xsn/mtmd_pixtral branch from e86b7ea to 783de6e Compare April 22, 2025 14:32

ngxson added 2 commits April 22, 2025 16:43

fix bad rebase

983829d

first working version

e1fd3d1

ngxson changed the title ~~mtmd : Support Pixtral 12B (help needed - 2D RoPE)~~ mtmd : Support Pixtral 12B Apr 22, 2025

ngxson commented Apr 22, 2025

View reviewed changes

ngxson added 4 commits April 23, 2025 08:28

fix problem with img_break token

a15f1ac

support dynamic image size

a8b0f14

update docs

6535c98

update test script

ad69c3a

github-actions bot added the documentation Improvements or additions to documentation label Apr 23, 2025

ngxson marked this pull request as ready for review April 23, 2025 15:08

ngxson requested a review from ggerganov April 23, 2025 15:09

ggerganov approved these changes Apr 23, 2025

View reviewed changes

ngxson merged commit ecda2ec into ggml-org:master Apr 23, 2025
51 checks passed

ngxson mentioned this pull request Apr 24, 2025

server : vision support via libmtmd #12898

Merged

8 tasks

ddpasa mentioned this pull request Apr 24, 2025

Support Mistral's new visual model: Pixtral-12b-240910 ollama/ollama#6748

Open

ngxson commented Apr 24, 2025

View reviewed changes

ngxson mentioned this pull request Apr 24, 2025

clip : fix pixtral on some GPU backends #13097

Merged

mtmd : Support Pixtral 12B #13065

mtmd : Support Pixtral 12B #13065

Uh oh!

Conversation

ngxson commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Demo

Checklist (done)

Uh oh!

ngxson Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

HimariO Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 23, 2025

Uh oh!

Uh oh!

ngxson commented Apr 23, 2025

Uh oh!

bartowski1182 commented Apr 23, 2025

Uh oh!

BugReporterZ commented Apr 23, 2025

Uh oh!

stduhpf commented Apr 23, 2025

Uh oh!

BugReporterZ commented Apr 23, 2025

Uh oh!

stduhpf commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugReporterZ commented Apr 23, 2025

Uh oh!

stduhpf commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Apr 24, 2025

Uh oh!

ggerganov commented Apr 24, 2025

Uh oh!

ngxson commented Apr 24, 2025

Uh oh!

LostRuins commented Apr 24, 2025

Uh oh!

ngxson commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 24, 2025

Uh oh!

ngxson Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Apr 22, 2025 •

edited

Loading

ngxson Apr 22, 2025 •

edited

Loading

ngxson Apr 23, 2025 •

edited

Loading

stduhpf commented Apr 23, 2025 •

edited

Loading

stduhpf commented Apr 24, 2025 •

edited

Loading

ngxson commented Apr 24, 2025 •

edited

Loading

ngxson commented Apr 24, 2025 •

edited

Loading

ngxson Apr 24, 2025 •

edited

Loading