Skip to content

mtmd : Support Pixtral 12B #13065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 23, 2025
Merged

mtmd : Support Pixtral 12B #13065

merged 8 commits into from
Apr 23, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 22, 2025

Pre-quantized GGUF: https://huggingface.co/ggml-org/pixtral-12b-GGUF

llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF

To convert the mmproj yourself, use convert_hf_to_gguf.py with --mmproj flag

Demo

$ llama-mtmd-cli -m ../models/pixtral-12b/model.gguf \
  --mmproj ../models/pixtral-12b/mmproj-model.gguf \
  --image ../models/eiffel-tower.jpg -p "what do you see" -c 8000

Result:

The image showcases the Eiffel Tower, a renowned landmark in Paris, France. The tower stands tall and prominent against a backdrop of a partly cloudy sky, with the sun casting a warm glow, suggesting it might be either early morning or late afternoon. The structure of the Eiffel Tower is clearly visible, with its intricate iron latticework and distinct design.

Checklist (done)

@github-actions github-actions bot added examples python python script changes labels Apr 22, 2025
@ngxson ngxson force-pushed the xsn/mtmd_pixtral branch from e86b7ea to 783de6e Compare April 22, 2025 14:32
@ngxson ngxson changed the title mtmd : Support Pixtral 12B (help needed - 2D RoPE) mtmd : Support Pixtral 12B Apr 22, 2025
Comment on lines 572 to 579
// for example, if we have a list of inv_freq: 1e-0, 1e-1, 1e-2, 1e-3
// first half will use 1e-0, 1e-2 (even)
// second half will use 1e-1, 1e-3 (odd)
// the trick here is to rotate just half of n_dim, so inv_freq will automatically be even
// ^ don't ask me why, it's math! -2(2i) / n_dim == -2i / (n_dim/2))
// then for the second half, we use freq_scale to shift the inv_freq
// ^ why? replace (2i) with (2i+1) in the above equation
const float freq_scale = std::pow(freq_base, (float)-2/n_dim);
Copy link
Collaborator Author

@ngxson ngxson Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on python implementation, this is done by firstly create a list of inv_freq, then select the odd/even positions by doing freqs[::2] and freqs[1::2]

here, with a bit of math, we can archive the same effect without even materializing the list of inv_freq in the first place

for even freq:

$$\text{freq\_base}^{\frac{-2*2i}{n_{\text{dim}}}} = \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2}}$$

and for odd:

$$\begin{align*} \text{freq\_base}^{\frac{-2(2i+1)}{n_{\text{dim}}}} &= \text{freq\_base}^{\frac{-2*2i-2}{n_{\text{dim}}}} \\ % Correct line break and alignment &= \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2} + \frac{-2}{n_{\text{dim}}}} \\ % Correct line break and alignment &= \text{freq\_base}^{\frac{-2i}{n_{\text{dim}}/2}} \cdot \text{freq\_base}^{\frac{-2}{n_{\text{dim}}}} % Correct mathematical rule (product, not sum) and line break \end{align*}$$

Copy link
Collaborator Author

@ngxson ngxson Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov @HimariO while working on this, I realized that maybe M-RoPE used by Qwen2VL can also be implemented using ggml_view and ggml_rope_ext_inplace

Please correct if I'm wrong, AFAIU main idea of M-RoPE is to apply the same set of inv_freq to different sections of the embedding vector (so it's not even using odd/even positions like what pixtral does, but even easier!)

For example, with an embd vector of 8 elements: aabbccdd and mrope_sections=[1, 1, 2]
The vector will be split into 3 sections: aa, bb, ccdd
If I have position [x, y, z], then they will be applied to aa, bb, ccdd respectively
So, this can be re-implemented using ggml_view to view the specific section. The view will be a non-contiguous tensor, so as long as ggml_rope_ext_inplace works on non-contiguous, we're still OK here


The problem with ggml_rope_multi is that it is not supported on all backends, so re-using ggml_rope_ext_inplace allowing it to be supported on all existing backends without addition works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was looking at this as well today and I think you are correct. I think for now your approach of using ropes + views is better until we gather some experience with vision models and make sure that everything works correctly. Eventually, we would want to implement a more tightly integrated implementation such as ggml_rope_multi that is supported by all backends for better performance, but this can be done at a later stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach you mentioned would work when GGML_ROPE_TYPE_VISION is used. However, applying ggml_rope_ext_inplace separately to each vector section will produce different results compared to ggml_rope_multi when using the GGML_ROPE_TYPE_MROPE mode, due to how theta is computed.

That said, it's been quite a while since I last looked into RoPE-related implementations, so it would be better to just write a simple script to verify this.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 23, 2025
@ngxson ngxson marked this pull request as ready for review April 23, 2025 15:08
@ngxson ngxson requested a review from ggerganov April 23, 2025 15:09
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

Tests are also passed:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-qwen2vl-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M

@ngxson ngxson merged commit ecda2ec into ggml-org:master Apr 23, 2025
51 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

@bartowski1182 Time to make some quants 🚀

@bartowski1182
Copy link
Contributor

HUGE thank you for your changes to converting mmproj files... makes my life so much easier

https://huggingface.co/bartowski/mistral-community_pixtral-12b-GGUF

@BugReporterZ
Copy link

I'm getting strange responses with references to a grid-like pattern; it's as if images aren't being correctly tokenized. I'm using the quantizations from bartowski and the latest commit merging the pull request to master.

./build/bin/llama-mtmd-cli -m mistral-community_pixtral-12b-Q8_0.gguf --mmproj mmproj-mistral-community_pixtral-12b-f16.gguf -ngl 99 --image tour-eiffel.png -p "What do you see?" -c 8000

It seems like you're displaying a collection of images with a grid layout. The images are of various sizes and shapes, and they appear to be scattered randomly across the grid. The background color of the grid is a dark blue, which contrasts with the lighter colors of the images.

image

I've rebuilt llama.cpp with:

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON
cmake --build build --config Release -j 20

@stduhpf
Copy link
Contributor

stduhpf commented Apr 23, 2025

It seems like you're displaying a collection of images with a grid layout. The images are of various sizes and shapes, and they appear to be scattered randomly across the grid. The background color of the grid is a dark blue, which contrasts with the lighter colors of the images.

I get the exact same kind of glitched responses with Vulkan backend, sometimes it's "It appears that the image you provided is highly pixelated..." or "A mosaic of various shades...".

@BugReporterZ
Copy link

I'm using the CUDA backend:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5174 (56304069) with x86_64-conda-linux-gnu-cc (Anaconda gcc) 11.2.0 for x86_64-conda-linux-gnu

@stduhpf
Copy link
Contributor

stduhpf commented Apr 23, 2025

It works very well on CPU backend. Seems like there is an issue with the image encoding on GPU backends

@BugReporterZ
Copy link

It does seem to work if I prepend the command with CUDA_VISIBLE_DEVICES=-1, disabling the GPU.

The image depicts a nighttime scene of a large, intricate structure illuminated against a dark sky. The structure appears to be a complex network of interconnected geometric shapes, possibly made of wood or another similar material. The lighting highlights the various angles and patterns of the structure, creating a visually striking display.

In the foreground, there is a crowd of people gathered, suggesting that this might be an event or festival. The people are standing and observing the illuminated structure, indicating that it is likely a point of interest or attraction.

The overall atmosphere of the image is festive and communal, with the illuminated structure serving as a focal point for the gathering. The intricate design and lighting of the structure suggest that it could be part of an art installation or a temporary architectural piece designed for a specific event.

It's not seeing that as the Eiffel Tower, but it's seemingly recognizing general image features correctly.

@stduhpf
Copy link
Contributor

stduhpf commented Apr 24, 2025

It works very well on CPU backend. Seems like there is an issue with the image encoding on GPU backends

Nevermind, it only kinda works on CPU backend.

Exemple failure case
./buildCPU/bin/Release/llama-mtmd-cli.exe -m ./models/pixtral-12b-Q4_K_M.gguf --mmproj ./models/mmproj-pixtral-12b-f16.gguf --image ../images/output.png -p "Describe this image in detail." -c 8000

output - Copy (8)

The image appears to be a digital art piece or a manipulated photograph that creates a distorted and abstract representation of a group of people. Here is a detailed description:

  1. Subjects: The image features numerous figures, primarily of people. The figures are depicted from the back or side, showing their upper bodies and heads. They are wearing casual clothing, including t-shirts and shorts.

  2. Color and Style: The color palette is muted and desaturated, with a predominant use of grays, blacks, and whites. The figures have a somewhat monochromatic appearance, with subtle hints of color in their hair and clothing.

  3. Distortion: The image is heavily distorted, with the figures appearing fragmented and overlapping. This gives the impression of a kaleidoscopic or mirrored effect, where the figures repeat and blend into each other.

  4. Background: The background is slightly blurred, showing what appears to be an outdoor setting with buildings and possibly a street. The buildings have a European architectural style, suggesting an urban environment.

  5. Composition: The composition is dynamic, with the figures arranged in a way that creates a sense of movement and depth. The overlapping and repetition of the figures create a layered effect, adding to the complexity of the image.

  6. Artistic Elements: The overall style of the image suggests an artistic or conceptual approach, possibly aiming to convey a sense of crowd, anonymity, or the human experience in an urban setting.

This detailed description should help you understand the content and style of the image.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

It's possible that my ggml_rope_ext_inplace hack to support 2D RoPE does not work on CUDA, hence the spatial position messed up and the model perceived bunch of weird patterns.

Testing more today to confirm.

@slaren I suspect that maybe the tensor is copied somewhere instead of keeping inplace, do you think so?

@LostRuins
Copy link
Collaborator

Hi @ngxson, is Pixtral supposed to be so token heavy?

Looking at https://huggingface.co/ggml-org/pixtral-12b-GGUF/tree/main?show_file_info=mmproj-pixtral-12b-f16.gguf, I see
clip.vision.patch_size = 16

and from the newly added code:

 if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) {
        int n_patches_x = img->nx / params.patch_size;
        int n_patches_y = img->ny / params.patch_size;
        n_patches = n_patches_y*n_patches_x + n_patches_y - 1; // + one [IMG_BREAK] per row, except the last row

Loading a 1024x1024px image, we get
n_patches_x = (1024/16) = 64
n_patches_y = (1024/16) = 64
n_patches = 64*64 + 64-1 = 4159

So we need 4159 tokens just to process a single 1024x1024 image? Trying with the updated mtmd-cli, and this seems to be what is used... Or are the images supposed to be downscaled first?

@ggerganov
Copy link
Member

I did some more testing and it seems that there are issues also with the Metal and even CPU-only implementation. Resizing and cropping a problematic image can make it work correctly, but I don't see a specific pattern.

Here is a simple repro using Lenna:

# use original 512x512 image, (wrong answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna.png -p "How many people do you see in the image?" --top-k 1 

The image shows two people wearing large straw hats with feathers. The hats are prominently featured, and the individuals appear to be posing for the photograph. The background is blurred, which helps to emphasize the subjects in the foreground. The overall atmosphere of the image is warm and vibrant, possibly suggesting a sunny or tropical setting. The individuals expressions and postures are not clearly visible due to the angle and focus of the image.

# resize to 1024x1024
convert ~/lenna.png -resize 200% ~/lenna-1024.png

# use resized 1024x1024 image, (correct answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna-1024.png -p "How many people do you see in the image?" --top-k 1 

In the image, I see one person. The person is wearing a hat with a wide brim and decorative feathers. The background appears to be indoors with a warm, reddish tone. The person has long hair and is looking directly at the camera.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

So we need 4159 tokens just to process a single 1024x1024 image? Trying with the updated mtmd-cli, and this seems to be what is used... Or are the images supposed to be downscaled first?

@LostRuins 1024x1024 image = 1048576 pixels. So, how many tokens you expect it to use?

@LostRuins
Copy link
Collaborator

I was mostly comparing it to gemma3, which uses 256 tokens for a 896x896 image - and likewise, most of the other multimodal examples use at most between 1k-2k tokens at that resolution. Just an observation.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

gemma 3 and some other models using conv2d to compress the output token, it works because the image size is fixed.

pixtral and qwen2vl support dynamic size, so N embeddings from vision encoder will be projected as N tokens in text model.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

@ggerganov Ok thanks for testing, I confirm that I get the same output as yours in my test

One super weird thing though, if I use a Q8_0 mmproj file, the problem disappears:

The image shows a woman wearing a wide-brimmed hat with feathers. The hat appears to be made of straw or a similar material. The woman has long, dark hair that is partially visible beneath the hat. The background is blurred, but it seems to be an outdoor setting with some vertical structures, possibly wooden poles or beams. The overall tone of the image has a warm, vintage feel, possibly due to the color grading or filter applied.

The Q8_0 can be generated via: python convert_hf_to_gguf.py --outtype q8_0 --remote mistral-community/pixtral-12b --mmproj

Comment on lines +597 to +615
{
tmp = ggml_view_3d(ctx0, cur,
n_dim/2, n_head, n_pos,
ggml_row_size(cur->type, n_dim),
ggml_row_size(cur->type, n_dim*n_head),
n_dim/2 * ggml_element_size(cur));
tmp = ggml_rope_ext_inplace(
ctx0,
tmp,
pos_w, // positions
nullptr, // freq factors
n_dim/2, // n_dims
0, 0, freq_base,
freq_scale_odd,
0.0f, 1.0f, 0.0f, 0.0f
);
// calculate inplace (modify cur directly)
ggml_build_forward_expand(gf, tmp);
}
Copy link
Collaborator Author

@ngxson ngxson Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok after more tinkering, I discovered that ggml_rope_ext_inplace does not work very well on a non-contiguous tensor, I'm thinking of another hack, or maybe reuse ggml_rope_multi

@ggerganov the reason why it responses there are 2 people side-by-side was because this position embedding is incorrect, hence it perceives something like this (this is an illustration I made using Affinity Photo):

image

On CUDA backend, this code block make no changes to the cur tensor, hence why it says that it sees texture

pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants