PHI3-vision gguf conversion #7705

farris · 2024-06-03T01:22:44Z

This PR adds functionality to convert Phi-3-vision-128k-instruct to gguf format.

This heavily relies on existing scripts and logic under the LLAVA/ directory
The process is as follows:

Strip clip encoder from phi3v base model using llava/ scripts
convert clip encoder to gguf
Take decoder heads (language model weights only) from phi3v and assign them to a normal phi3-instruct model
Turn this into a .gguf
seems to work pretty well 🚀 , see example below

Eventually this should be cleaned and potentially separated from the LLAVA directory, even though these models are basically the same, the nomenclature might be a bit confusing @ggerganov

Lastly, I have very little c++ knowledge, but I was able to get this to work but muting some ggml library assertions, this needs to be examined more closely

phi3k-vision-github.mov

ggerganov · 2024-06-03T08:04:11Z

ggml.c

-    GGML_ASSERT(ggml_can_mul_mat(a, b));
-    GGML_ASSERT(!ggml_is_transposed(a));
+    // GGML_ASSERT(ggml_can_mul_mat(a, b));
+    // GGML_ASSERT(!ggml_is_transposed(a));


These asserts should not be removed. If you hit them, then there is most likely something wrong with the input data

I think I found the issue,

tensor b should be [1024, 576] -> [4096, 576]

For LLAVA these dimensions are right:

But for phi3v we need the mm_projector weight tensor to be [4096, 576]:

Any idea on how to fix this?

This might be it. I think it will have to be implemented in clip.cpp
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/image_embedding_phi3_v.py#L219:~:text=%23%20(num_crops%2C%2012%2C%202,num_img_tokens%2C%201024*4)

github-actions · 2024-06-03T16:34:24Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8524.42ms p(95)=21641.03ms fails=, finish reason: stop=495 truncated=55
Prompt processing (pp): avg=91.62tk/s p(95)=369.15tk/s
Token generation (tg): avg=33.04tk/s p(95)=48.15tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=farris-phi3v commit=efeaeaf79fe855312b18f68c3760d727d42c9bbf

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 271.08, 271.08, 271.08, 271.08, 271.08, 836.75, 836.75, 836.75, 836.75, 836.75, 772.34, 772.34, 772.34, 772.34, 772.34, 835.11, 835.11, 835.11, 835.11, 835.11, 892.07, 892.07, 892.07, 892.07, 892.07, 885.48, 885.48, 885.48, 885.48, 885.48, 889.87, 889.87, 889.87, 889.87, 889.87, 904.36, 904.36, 904.36, 904.36, 904.36, 897.6, 897.6, 897.6, 897.6, 897.6, 910.13, 910.13, 910.13, 910.13, 910.13, 920.1, 920.1, 920.1, 920.1, 920.1, 953.21, 953.21, 953.21, 953.21, 953.21, 952.4, 952.4, 952.4, 952.4, 952.4, 922.68, 922.68, 922.68, 922.68, 922.68, 925.21, 925.21, 925.21, 925.21, 925.21, 927.05, 927.05, 927.05, 927.05, 927.05, 926.95, 926.95, 926.95, 926.95, 926.95, 921.64, 921.64, 921.64, 921.64, 921.64, 933.38, 933.38, 933.38, 933.38, 933.38, 931.76, 931.76, 931.76, 931.76, 931.76, 935.44, 935.44, 935.44, 935.44, 935.44, 932.47, 932.47, 932.47, 932.47, 932.47, 946.32, 946.32, 946.32, 946.32, 946.32, 923.83, 923.83, 923.83, 923.83, 923.83, 924.41, 924.41, 924.41, 924.41, 924.41, 924.02, 924.02, 924.02, 924.02, 924.02, 930.59, 930.59, 930.59, 930.59, 930.59, 928.98, 928.98, 928.98, 928.98, 928.98, 927.75, 927.75, 927.75, 927.75, 927.75, 932.56, 932.56, 932.56, 932.56, 932.56, 930.29, 930.29, 930.29, 930.29, 930.29, 928.68, 928.68, 928.68, 928.68, 928.68, 924.81, 924.81, 924.81, 924.81, 924.81, 932.29, 932.29, 932.29, 932.29, 932.29, 933.95, 933.95, 933.95, 933.95, 933.95, 931.14, 931.14, 931.14, 931.14, 931.14, 884.65, 884.65, 884.65, 884.65, 884.65, 881.26, 881.26, 881.26, 881.26, 881.26, 882.92, 882.92, 882.92, 882.92, 882.92, 885.64, 885.64, 885.64, 885.64, 885.64, 885.23, 885.23, 885.23, 885.23, 885.23, 865.02, 865.02, 865.02, 865.02, 865.02, 843.14, 843.14, 843.14, 843.14, 843.14, 842.08, 842.08, 842.08, 842.08, 842.08, 840.86, 840.86, 840.86, 840.86, 840.86, 835.52, 835.52, 835.52, 835.52, 835.52, 843.06, 843.06, 843.06, 843.06, 843.06, 842.07, 842.07, 842.07, 842.07, 842.07, 844.91, 844.91, 844.91, 844.91, 844.91, 846.65, 846.65, 846.65, 846.65, 846.65, 849.49, 849.49, 849.49, 849.49, 849.49, 844.33, 844.33, 844.33, 844.33, 844.33, 842.74, 842.74, 842.74, 842.74, 842.74, 841.15, 841.15, 841.15, 841.15, 841.15, 840.03, 840.03, 840.03, 840.03, 840.03, 841.14, 841.14, 841.14, 841.14, 841.14, 841.24, 841.24, 841.24, 841.24, 841.24, 842.26, 842.26, 842.26, 842.26, 842.26, 843.65, 843.65, 843.65, 843.65, 843.65, 846.75, 846.75, 846.75, 846.75, 846.75, 847.04, 847.04, 847.04, 847.04, 847.04, 847.04]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.83, 44.83, 44.83, 44.83, 44.83, 29.15, 29.15, 29.15, 29.15, 29.15, 33.63, 33.63, 33.63, 33.63, 33.63, 33.32, 33.32, 33.32, 33.32, 33.32, 33.89, 33.89, 33.89, 33.89, 33.89, 34.67, 34.67, 34.67, 34.67, 34.67, 34.97, 34.97, 34.97, 34.97, 34.97, 35.2, 35.2, 35.2, 35.2, 35.2, 34.85, 34.85, 34.85, 34.85, 34.85, 34.82, 34.82, 34.82, 34.82, 34.82, 34.72, 34.72, 34.72, 34.72, 34.72, 33.33, 33.33, 33.33, 33.33, 33.33, 33.3, 33.3, 33.3, 33.3, 33.3, 32.39, 32.39, 32.39, 32.39, 32.39, 31.61, 31.61, 31.61, 31.61, 31.61, 30.02, 30.02, 30.02, 30.02, 30.02, 30.22, 30.22, 30.22, 30.22, 30.22, 30.15, 30.15, 30.15, 30.15, 30.15, 30.08, 30.08, 30.08, 30.08, 30.08, 29.98, 29.98, 29.98, 29.98, 29.98, 30.0, 30.0, 30.0, 30.0, 30.0, 30.29, 30.29, 30.29, 30.29, 30.29, 30.16, 30.16, 30.16, 30.16, 30.16, 30.29, 30.29, 30.29, 30.29, 30.29, 30.42, 30.42, 30.42, 30.42, 30.42, 30.34, 30.34, 30.34, 30.34, 30.34, 30.23, 30.23, 30.23, 30.23, 30.23, 30.6, 30.6, 30.6, 30.6, 30.6, 30.7, 30.7, 30.7, 30.7, 30.7, 30.91, 30.91, 30.91, 30.91, 30.91, 31.08, 31.08, 31.08, 31.08, 31.08, 31.13, 31.13, 31.13, 31.13, 31.13, 31.04, 31.04, 31.04, 31.04, 31.04, 30.93, 30.93, 30.93, 30.93, 30.93, 30.6, 30.6, 30.6, 30.6, 30.6, 30.36, 30.36, 30.36, 30.36, 30.36, 30.35, 30.35, 30.35, 30.35, 30.35, 30.53, 30.53, 30.53, 30.53, 30.53, 30.63, 30.63, 30.63, 30.63, 30.63, 30.83, 30.83, 30.83, 30.83, 30.83, 30.95, 30.95, 30.95, 30.95, 30.95, 30.83, 30.83, 30.83, 30.83, 30.83, 30.61, 30.61, 30.61, 30.61, 30.61, 30.58, 30.58, 30.58, 30.58, 30.58, 29.06, 29.06, 29.06, 29.06, 29.06, 29.1, 29.1, 29.1, 29.1, 29.1, 29.09, 29.09, 29.09, 29.09, 29.09, 29.25, 29.25, 29.25, 29.25, 29.25, 29.3, 29.3, 29.3, 29.3, 29.3, 29.37, 29.37, 29.37, 29.37, 29.37, 29.42, 29.42, 29.42, 29.42, 29.42, 29.31, 29.31, 29.31, 29.31, 29.31, 29.26, 29.26, 29.26, 29.26, 29.26, 29.24, 29.24, 29.24, 29.24, 29.24, 29.31, 29.31, 29.31, 29.31, 29.31, 29.47, 29.47, 29.47, 29.47, 29.47, 29.49, 29.49, 29.49, 29.49, 29.49, 29.64, 29.64, 29.64, 29.64, 29.64, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.71]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.44, 0.44, 0.44, 0.44, 0.44, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.3, 0.3, 0.3, 0.3, 0.3, 0.24, 0.24, 0.24, 0.24, 0.24, 0.46, 0.46, 0.46, 0.46, 0.46, 0.31, 0.31, 0.31, 0.31, 0.31, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.32, 0.32, 0.32, 0.32, 0.32, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.49, 0.49, 0.49, 0.49, 0.49, 0.63, 0.63, 0.63, 0.63, 0.63, 0.59, 0.59, 0.59, 0.59, 0.59, 0.66, 0.66, 0.66, 0.66, 0.66, 0.09, 0.09, 0.09, 0.09, 0.09, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.32]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0]

farris · 2024-06-05T18:58:40Z

I added some projection handling for phi3v in clip.cpp. This is not the actual way the tensor from clip is meant to be handled (see link) but it is performing decent enough to serve as a stop-gap until the proper handling is implemented. This way we also don't need to mute any assertions in the underlying tensor library

See the README for instructions 👍

cmp-nct · 2024-06-13T01:58:49Z

examples/llava/convert-image-encoder-to-gguf.py

@@ -86,7 +86,7 @@ def bytes_to_unicode():
 ap.add_argument("--clip-model-is-openclip", action="store_true", required=False,
                help="The clip model is from openclip (for ViT-SO400M type))")
 ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.")
-ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp")
+ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi")


I'd not change the default

cmp-nct · 2024-06-13T16:42:36Z

@farris I've just converted the phi3 model, that works nicely
The inference however is not working beyond the clip encoder.
At first glance something is not working as intended

 .\build\bin\llava-cli.exe  -m Q:\models\llava\phi3\Phi-3-mini-128k-instruct\ggml-model-f16.gguf --mmproj Q:\models\llava\phi3\Phi-3-vision-128k-instruct\vit\mmproj-model-f16.gguf --image C:\temp\LICENSE_DEMO.jpg -c 4096  --temp .1 -p "Describe all visible content and respond in JSON" -ngl 99 -n 100 --ignore-eos
.....................................................................................
clip_model_load: model name:   vit-large336-custom
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    378
clip_model_load: n_kv:         25
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 25 key-value pairs and 378 tensors from Q:\models\llava\phi3\Phi-3-vision-128k-instruct\vit\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vit-large336-custom
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp_phi
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:           clip.vision.image_grid_pinpoints arr[i32,10]      = [336, 672, 672, 336, 672, 672, 1008, ...
clip_model_load: - kv  17:          clip.vision.image_crop_resolution u32              = 224
clip_model_load: - kv  18:             clip.vision.image_aspect_ratio str              = anyres
clip_model_load: - kv  19:         clip.vision.image_split_resolution u32              = 224
clip_model_load: - kv  20:            clip.vision.mm_patch_merge_type str              = spatial_unpad
clip_model_load: - kv  21:              clip.vision.mm_projector_type str              = mlp_phi
clip_model_load: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  24:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     597.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  597.49 MB (378 tensors)
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   300.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: 5 segments encoded in   148.70 ms
encode_image_with_clip: image embedding created: 2880 tokens

encode_image_with_clip: image encoded in   164.38 ms by CLIP (    0.06 ms per image patch)



llama_print_timings:        load time =    3473.43 ms
llama_print_timings:      sample time =       0.03 ms /     1 runs   (    0.03 ms per token, 34482.76 tokens per second)
llama_print_timings: prompt eval time =     390.27 ms /  2928 tokens (    0.13 ms per token,  7502.44 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    3479.98 ms /  2929 tokens

At -b1 a ton of KQ tensor warnings come up, possibly unrelated
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

cmp-nct · 2024-06-14T02:14:05Z

Update:
It looks like an escape condition was hit in the llava-cli client, that's what caused the zero tokens output.
When using the phi3 template it works.

I'm using the california drivers license image for OCR test and this prompt:
<|user|>\n<image>\nProvide a complete list of all you can see, including all text (id, date, name, sig,etc)<E|end|>\n<|assistant|>\n

I am now getting results which look quite promising, I would be super happy if the DL number of the license demo would not be that flawed.
Web space:

The image shows a California driver's license. The visible text includes 'DL' followed by a number '11234568', 'EXP 08/31/2014', 'END NONE', 'LN CARDHOLDER', 'FNIMA', 'SAMPLE', '2570 24TH STREET ANYTOWN, CA 95818', 'DOB 08/31/1977', 'RSTR NONE', '08311977', 'DONOR', 'VETERAN', 'SEX F', 'HAIR BRN', 'EYES BRN', 'HGT 5'-05"', 'WGT 125 lb', 'ISS 08/31/2009', and 'DD 00/00/0000NNNNAN/ANFD/YY'. The license also features a signature at the bottom left corner that reads 'Ima Cordhollad'. The background of the license includes a graphic of a bear and a figure of a Native American.

llava-cli:

The image shows a driver's license from California. The license is partially visible with various pieces of information. The visible text includes 'California', 'DRIVER LICENSE', 'IMA CARDHOLDER', 'DL 1123434568', 'EXP 08/31/2014', 'END NONE', 'DL 08/31/1977', 'END NONE', '083111977', '08/31/2009', '08/31/2009', ...

From my pov we only need to remove the default to phi3 in the python script, if possible fix the graph (ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid) then this should be merged.

cmp-nct · 2024-06-14T13:16:34Z

examples/llava/phi3-weight-transfer.py

consider to put this entire logic into llava_surgery_v2.py

cmp-nct · 2024-06-14T13:23:33Z

examples/llava/convert-image-encoder-to-gguf.py

@@ -86,7 +86,7 @@ def bytes_to_unicode():
 ap.add_argument("--clip-model-is-openclip", action="store_true", required=False,
                help="The clip model is from openclip (for ViT-SO400M type))")
 ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.")
-ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp")
+ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi")


change back to "mlp" default and add the phi one into "possible values" help text

cmp-nct · 2024-06-14T13:24:22Z

examples/llava/README.md

+
+mkdir phi3-vision
+git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
+


The directories won't match up at this point as "git clone" creates their own subdirs.
renaming them would be better, and 2x mkdir could be removed

cmp-nct · 2024-06-14T13:25:04Z

examples/llava/README.md

+
+5) Create the visual gguf model:
+```console
+python examples/llava/convert-image-encoder-to-gguf.py -m phi3-fun/phi3-vision/vit --llava-projector phi3-fun/phi3-vision/vit/llava.projector --output-dir phi3-fun/phi3-vision/vit --clip-model-is-vision


--projector-type mlp_phi
I don't think that changing the config.json should still be required when specifying this, that would remove one necessary manual step from the list

cmp-nct · 2024-06-14T13:26:31Z

examples/llava/README.md

+
+8) Invoke
+```console
+./llava-cli -m phi3-fun/phi3-base/ggml-model-f16.gguf --mmproj phi3-fun/phi3-vision/vit/mmproj-model-f16.gguf --image IMAGE -c 4096  --temp .1 -p "PROMPT"


Templating should be recommended.
The below one should be correct for phi3v
<|user|>\n<image>\nPROMPT<|end|>\n<|assistant|>\n

farris · 2024-06-16T04:38:55Z

@cmp-nct Thanks for taking a look, and glad that you got it to work.
I'll add the changes soon. I'm still experimenting with the projection layer but perhaps we can merge this for now
and open an issue for consistency between this and the hf implementation.

cmp-nct · 2024-06-16T13:28:21Z

@cmp-nct Thanks for taking a look, and glad that you got it to work. I'll add the changes soon. I'm still experimenting with the projection layer but perhaps we can merge this for now and open an issue for consistency between this and the hf implementation.

That was also my thought, let's get it merged and if possible it would be great to fix the modelling issues from there on.
I have a test image I used for phi3, which is very well solved (not flawless but impressive) on the HF space for phi3 vision, but the fp16 ggml model (temp 0) fails in everything on the same picture.
It does barely see anything correct, mixes up a lot.
So something is wonky, maybe more than just the projector. Our CLIP model itself might have an issue.
I was not able to replicate tensor input/output on our clip compared to the official reference.

https://i.ibb.co/zhh8wKn/calculation.png

cmp-nct · 2024-06-18T02:21:32Z

@farris
I noticed another, likely quite serious, discrepancy
Using the current method (due to the config used for the VIT) we have this set:
clip.vision.mm_patch_merge_type str = spatial_unpad
This enabled llava-next/1.6 preprocessing (you can see it in the 1700 tokens generated).

I've not installed the python backend on my PC yet, just looking at it:
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/image_processing_phi3_v.py

padding_336() seems to pad with black instead of the MEAN color
There is a hardcoded (in addition to config.json) only 144 tokens per 336 pixels

shapes = [[im.size[1], im.size[0]] for im in elems]
num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]

so the image appears to be shaped to multiples of 336 height/width, each of those get 144 tokens (which for 4 such segments would result in our usual 576 tokens as in llava 1.5)

Based on that I think our current handling of Microsofts phi3-v is completely wrong, sadly

mann1x · 2024-06-21T08:54:09Z

@cmp-nct
As suggested by Lee Stott from MS team, I tagged this thread on HF and kindly asked to support the integration:

https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/40

Hopefully someone from the phi team will jump in to help and give some advice

chigkim · 2024-09-01T14:24:17Z

+100! Any update from MS?

publish branch

efeaeaf

github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2024

farris mentioned this pull request Jun 3, 2024

FR: Phi-3-vision-128k-instruct implementation #7444

Closed

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 3, 2024

tkoenig89 mentioned this pull request Jun 3, 2024

phi3 medium small vision ollama/ollama#4560

Open

ggerganov reviewed Jun 3, 2024

View reviewed changes

add phi3v projection handling in clip.cpp

5175117

xhedit mentioned this pull request Jun 10, 2024

Support for Phi-3-vision-128k-instruct xhedit/quantkit#5

Open

nischalj10 mentioned this pull request Jun 12, 2024

clip_model_load: don't support projector with: currently ollama/ollama#4925

Closed

cmp-nct reviewed Jun 13, 2024

View reviewed changes

cmp-nct suggested changes Jun 14, 2024

View reviewed changes

This was referenced Jul 3, 2024

Failing to convert the new PHI-3 models. #8259

Closed

STILL no way to convert phi-3-small to GGUF #8241

Closed

This comment was marked as outdated.

Sign in to view

levicki mentioned this pull request Sep 15, 2024

Phi-3 Vision ollama/ollama#4591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHI3-vision gguf conversion #7705

PHI3-vision gguf conversion #7705

farris commented Jun 3, 2024 •

edited

Loading

ggerganov Jun 3, 2024

farris Jun 3, 2024

farris Jun 4, 2024

github-actions bot commented Jun 3, 2024

farris commented Jun 5, 2024

cmp-nct Jun 13, 2024

cmp-nct commented Jun 13, 2024 •

edited

Loading

cmp-nct commented Jun 14, 2024 •

edited

Loading

cmp-nct Jun 14, 2024

cmp-nct Jun 14, 2024 •

edited

Loading

cmp-nct Jun 14, 2024

cmp-nct Jun 14, 2024 •

edited

Loading

cmp-nct Jun 14, 2024

farris commented Jun 16, 2024

cmp-nct commented Jun 16, 2024 •

edited

Loading

cmp-nct commented Jun 18, 2024 •

edited

Loading

mann1x commented Jun 21, 2024

chigkim commented Sep 1, 2024

This comment was marked as outdated.


		mkdir phi3-vision
		git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

PHI3-vision gguf conversion #7705

Are you sure you want to change the base?

PHI3-vision gguf conversion #7705

Conversation

farris commented Jun 3, 2024 • edited Loading

ggerganov Jun 3, 2024

Choose a reason for hiding this comment

farris Jun 3, 2024

Choose a reason for hiding this comment

farris Jun 4, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 3, 2024

farris commented Jun 5, 2024

cmp-nct Jun 13, 2024

Choose a reason for hiding this comment

cmp-nct commented Jun 13, 2024 • edited Loading

cmp-nct commented Jun 14, 2024 • edited Loading

cmp-nct Jun 14, 2024

Choose a reason for hiding this comment

cmp-nct Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

cmp-nct Jun 14, 2024

Choose a reason for hiding this comment

cmp-nct Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

cmp-nct Jun 14, 2024

Choose a reason for hiding this comment

farris commented Jun 16, 2024

cmp-nct commented Jun 16, 2024 • edited Loading

cmp-nct commented Jun 18, 2024 • edited Loading

mann1x commented Jun 21, 2024

chigkim commented Sep 1, 2024

This comment was marked as outdated.

farris commented Jun 3, 2024 •

edited

Loading

cmp-nct commented Jun 13, 2024 •

edited

Loading

cmp-nct commented Jun 14, 2024 •

edited

Loading

cmp-nct Jun 14, 2024 •

edited

Loading

cmp-nct Jun 14, 2024 •

edited

Loading

cmp-nct commented Jun 16, 2024 •

edited

Loading

cmp-nct commented Jun 18, 2024 •

edited

Loading