-
Notifications
You must be signed in to change notification settings - Fork 11.6k
mtmd : Support Pixtral 12B #13065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mtmd : Support Pixtral 12B #13065
Conversation
e86b7ea
to
783de6e
Compare
examples/llava/clip.cpp
Outdated
// for example, if we have a list of inv_freq: 1e-0, 1e-1, 1e-2, 1e-3 | ||
// first half will use 1e-0, 1e-2 (even) | ||
// second half will use 1e-1, 1e-3 (odd) | ||
// the trick here is to rotate just half of n_dim, so inv_freq will automatically be even | ||
// ^ don't ask me why, it's math! -2(2i) / n_dim == -2i / (n_dim/2)) | ||
// then for the second half, we use freq_scale to shift the inv_freq | ||
// ^ why? replace (2i) with (2i+1) in the above equation | ||
const float freq_scale = std::pow(freq_base, (float)-2/n_dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on python implementation, this is done by firstly create a list of inv_freq
, then select the odd/even positions by doing freqs[::2]
and freqs[1::2]
here, with a bit of math, we can archive the same effect without even materializing the list of inv_freq
in the first place
for even freq:
and for odd:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov @HimariO while working on this, I realized that maybe M-RoPE used by Qwen2VL can also be implemented using ggml_view
and ggml_rope_ext_inplace
Please correct if I'm wrong, AFAIU main idea of M-RoPE is to apply the same set of inv_freq
to different sections of the embedding vector (so it's not even using odd/even positions like what pixtral does, but even easier!)
For example, with an embd vector of 8 elements: aabbccdd
and mrope_sections=[1, 1, 2]
The vector will be split into 3 sections: aa
, bb
, ccdd
If I have position [x, y, z]
, then they will be applied to aa
, bb
, ccdd
respectively
So, this can be re-implemented using ggml_view
to view the specific section. The view will be a non-contiguous tensor, so as long as ggml_rope_ext_inplace
works on non-contiguous, we're still OK here
The problem with ggml_rope_multi
is that it is not supported on all backends, so re-using ggml_rope_ext_inplace
allowing it to be supported on all existing backends without addition works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was looking at this as well today and I think you are correct. I think for now your approach of using ropes + views is better until we gather some experience with vision models and make sure that everything works correctly. Eventually, we would want to implement a more tightly integrated implementation such as ggml_rope_multi
that is supported by all backends for better performance, but this can be done at a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach you mentioned would work when GGML_ROPE_TYPE_VISION
is used. However, applying ggml_rope_ext_inplace
separately to each vector section will produce different results compared to ggml_rope_multi
when using the GGML_ROPE_TYPE_MROPE
mode, due to how theta is computed.
That said, it's been quite a while since I last looked into RoPE-related implementations, so it would be better to just write a simple script to verify this.
Tests are also passed:
|
@bartowski1182 Time to make some quants 🚀 |
HUGE thank you for your changes to converting mmproj files... makes my life so much easier https://huggingface.co/bartowski/mistral-community_pixtral-12b-GGUF |
I'm getting strange responses with references to a grid-like pattern; it's as if images aren't being correctly tokenized. I'm using the quantizations from bartowski and the latest commit merging the pull request to
I've rebuilt llama.cpp with:
|
I get the exact same kind of glitched responses with Vulkan backend, sometimes it's "It appears that the image you provided is highly pixelated..." or "A mosaic of various shades...". |
I'm using the CUDA backend:
|
It works very well on CPU backend. Seems like there is an issue with the image encoding on GPU backends |
It does seem to work if I prepend the command with
It's not seeing that as the Eiffel Tower, but it's seemingly recognizing general image features correctly. |
Nevermind, it only kinda works on CPU backend. Exemple failure case
|
It's possible that my ggml_rope_ext_inplace hack to support 2D RoPE does not work on CUDA, hence the spatial position messed up and the model perceived bunch of weird patterns. Testing more today to confirm. @slaren I suspect that maybe the tensor is copied somewhere instead of keeping inplace, do you think so? |
Hi @ngxson, is Pixtral supposed to be so token heavy? Looking at https://huggingface.co/ggml-org/pixtral-12b-GGUF/tree/main?show_file_info=mmproj-pixtral-12b-f16.gguf, I see and from the newly added code:
Loading a 1024x1024px image, we get So we need 4159 tokens just to process a single 1024x1024 image? Trying with the updated mtmd-cli, and this seems to be what is used... Or are the images supposed to be downscaled first? |
I did some more testing and it seems that there are issues also with the Metal and even CPU-only implementation. Resizing and cropping a problematic image can make it work correctly, but I don't see a specific pattern. Here is a simple repro using Lenna: # use original 512x512 image, (wrong answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna.png -p "How many people do you see in the image?" --top-k 1
The image shows two people wearing large straw hats with feathers. The hats are prominently featured, and the individuals appear to be posing for the photograph. The background is blurred, which helps to emphasize the subjects in the foreground. The overall atmosphere of the image is warm and vibrant, possibly suggesting a sunny or tropical setting. The individuals expressions and postures are not clearly visible due to the angle and focus of the image.
# resize to 1024x1024
convert ~/lenna.png -resize 200% ~/lenna-1024.png
# use resized 1024x1024 image, (correct answer)
./bin/llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF:Q4_K_M -b 8192 -c 8192 --image ~/lenna-1024.png -p "How many people do you see in the image?" --top-k 1
In the image, I see one person. The person is wearing a hat with a wide brim and decorative feathers. The background appears to be indoors with a warm, reddish tone. The person has long hair and is looking directly at the camera. |
@LostRuins 1024x1024 image = 1048576 pixels. So, how many tokens you expect it to use? |
I was mostly comparing it to gemma3, which uses 256 tokens for a 896x896 image - and likewise, most of the other multimodal examples use at most between 1k-2k tokens at that resolution. Just an observation. |
gemma 3 and some other models using conv2d to compress the output token, it works because the image size is fixed. pixtral and qwen2vl support dynamic size, so N embeddings from vision encoder will be projected as N tokens in text model. |
@ggerganov Ok thanks for testing, I confirm that I get the same output as yours in my test One super weird thing though, if I use a Q8_0 mmproj file, the problem disappears:
The Q8_0 can be generated via: |
{ | ||
tmp = ggml_view_3d(ctx0, cur, | ||
n_dim/2, n_head, n_pos, | ||
ggml_row_size(cur->type, n_dim), | ||
ggml_row_size(cur->type, n_dim*n_head), | ||
n_dim/2 * ggml_element_size(cur)); | ||
tmp = ggml_rope_ext_inplace( | ||
ctx0, | ||
tmp, | ||
pos_w, // positions | ||
nullptr, // freq factors | ||
n_dim/2, // n_dims | ||
0, 0, freq_base, | ||
freq_scale_odd, | ||
0.0f, 1.0f, 0.0f, 0.0f | ||
); | ||
// calculate inplace (modify cur directly) | ||
ggml_build_forward_expand(gf, tmp); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok after more tinkering, I discovered that ggml_rope_ext_inplace
does not work very well on a non-contiguous tensor, I'm thinking of another hack, or maybe reuse ggml_rope_multi
@ggerganov the reason why it responses there are 2 people side-by-side was because this position embedding is incorrect, hence it perceives something like this (this is an illustration I made using Affinity Photo):

On CUDA backend, this code block make no changes to the cur
tensor, hence why it says that it sees texture
* add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script
Pre-quantized GGUF: https://huggingface.co/ggml-org/pixtral-12b-GGUF
To convert the mmproj yourself, use
convert_hf_to_gguf.py
with--mmproj
flagDemo
$ llama-mtmd-cli -m ../models/pixtral-12b/model.gguf \ --mmproj ../models/pixtral-12b/mmproj-model.gguf \ --image ../models/eiffel-tower.jpg -p "what do you see" -c 8000
Result:
The image showcases the Eiffel Tower, a renowned landmark in Paris, France. The tower stands tall and prominent against a backdrop of a partly cloudy sky, with the sun casting a warm glow, suggesting it might be either early morning or late afternoon. The structure of the Eiffel Tower is clearly visible, with its intricate iron latticework and distinct design.
Checklist (done)
clip.cpp
[IMG_BREAK]
and[IMG_END]
tokens