Skip to content

PHI3-vision gguf conversion #7705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

farris
Copy link

@farris farris commented Jun 3, 2024

This PR adds functionality to convert Phi-3-vision-128k-instruct to gguf format.

  • This heavily relies on existing scripts and logic under the LLAVA/ directory
  • The process is as follows:
  1. Strip clip encoder from phi3v base model using llava/ scripts
  2. convert clip encoder to gguf
  3. Take decoder heads (language model weights only) from phi3v and assign them to a normal phi3-instruct model
  4. Turn this into a .gguf
  5. seems to work pretty well 🚀 , see example below

Eventually this should be cleaned and potentially separated from the LLAVA directory, even though these models are basically the same, the nomenclature might be a bit confusing @ggerganov

Lastly, I have very little c++ knowledge, but I was able to get this to work but muting some ggml library assertions, this needs to be examined more closely

phi3k-vision-github.mov

@github-actions github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2024
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 3, 2024
ggml.c Outdated
GGML_ASSERT(ggml_can_mul_mat(a, b));
GGML_ASSERT(!ggml_is_transposed(a));
// GGML_ASSERT(ggml_can_mul_mat(a, b));
// GGML_ASSERT(!ggml_is_transposed(a));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These asserts should not be removed. If you hit them, then there is most likely something wrong with the input data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the issue,
Screenshot 2024-06-03 at 12 02 51 PM
tensor b should be [1024, 576] -> [4096, 576]

For LLAVA these dimensions are right:
Screenshot 2024-06-03 at 12 04 29 PM

But for phi3v we need the mm_projector weight tensor to be [4096, 576]:
Screenshot 2024-06-03 at 12 05 11 PM

Any idea on how to fix this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

github-actions bot commented Jun 3, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8524.42ms p(95)=21641.03ms fails=, finish reason: stop=495 truncated=55
  • Prompt processing (pp): avg=91.62tk/s p(95)=369.15tk/s
  • Token generation (tg): avg=33.04tk/s p(95)=48.15tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=farris-phi3v commit=efeaeaf79fe855312b18f68c3760d727d42c9bbf

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 271.08, 271.08, 271.08, 271.08, 271.08, 836.75, 836.75, 836.75, 836.75, 836.75, 772.34, 772.34, 772.34, 772.34, 772.34, 835.11, 835.11, 835.11, 835.11, 835.11, 892.07, 892.07, 892.07, 892.07, 892.07, 885.48, 885.48, 885.48, 885.48, 885.48, 889.87, 889.87, 889.87, 889.87, 889.87, 904.36, 904.36, 904.36, 904.36, 904.36, 897.6, 897.6, 897.6, 897.6, 897.6, 910.13, 910.13, 910.13, 910.13, 910.13, 920.1, 920.1, 920.1, 920.1, 920.1, 953.21, 953.21, 953.21, 953.21, 953.21, 952.4, 952.4, 952.4, 952.4, 952.4, 922.68, 922.68, 922.68, 922.68, 922.68, 925.21, 925.21, 925.21, 925.21, 925.21, 927.05, 927.05, 927.05, 927.05, 927.05, 926.95, 926.95, 926.95, 926.95, 926.95, 921.64, 921.64, 921.64, 921.64, 921.64, 933.38, 933.38, 933.38, 933.38, 933.38, 931.76, 931.76, 931.76, 931.76, 931.76, 935.44, 935.44, 935.44, 935.44, 935.44, 932.47, 932.47, 932.47, 932.47, 932.47, 946.32, 946.32, 946.32, 946.32, 946.32, 923.83, 923.83, 923.83, 923.83, 923.83, 924.41, 924.41, 924.41, 924.41, 924.41, 924.02, 924.02, 924.02, 924.02, 924.02, 930.59, 930.59, 930.59, 930.59, 930.59, 928.98, 928.98, 928.98, 928.98, 928.98, 927.75, 927.75, 927.75, 927.75, 927.75, 932.56, 932.56, 932.56, 932.56, 932.56, 930.29, 930.29, 930.29, 930.29, 930.29, 928.68, 928.68, 928.68, 928.68, 928.68, 924.81, 924.81, 924.81, 924.81, 924.81, 932.29, 932.29, 932.29, 932.29, 932.29, 933.95, 933.95, 933.95, 933.95, 933.95, 931.14, 931.14, 931.14, 931.14, 931.14, 884.65, 884.65, 884.65, 884.65, 884.65, 881.26, 881.26, 881.26, 881.26, 881.26, 882.92, 882.92, 882.92, 882.92, 882.92, 885.64, 885.64, 885.64, 885.64, 885.64, 885.23, 885.23, 885.23, 885.23, 885.23, 865.02, 865.02, 865.02, 865.02, 865.02, 843.14, 843.14, 843.14, 843.14, 843.14, 842.08, 842.08, 842.08, 842.08, 842.08, 840.86, 840.86, 840.86, 840.86, 840.86, 835.52, 835.52, 835.52, 835.52, 835.52, 843.06, 843.06, 843.06, 843.06, 843.06, 842.07, 842.07, 842.07, 842.07, 842.07, 844.91, 844.91, 844.91, 844.91, 844.91, 846.65, 846.65, 846.65, 846.65, 846.65, 849.49, 849.49, 849.49, 849.49, 849.49, 844.33, 844.33, 844.33, 844.33, 844.33, 842.74, 842.74, 842.74, 842.74, 842.74, 841.15, 841.15, 841.15, 841.15, 841.15, 840.03, 840.03, 840.03, 840.03, 840.03, 841.14, 841.14, 841.14, 841.14, 841.14, 841.24, 841.24, 841.24, 841.24, 841.24, 842.26, 842.26, 842.26, 842.26, 842.26, 843.65, 843.65, 843.65, 843.65, 843.65, 846.75, 846.75, 846.75, 846.75, 846.75, 847.04, 847.04, 847.04, 847.04, 847.04, 847.04]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.83, 44.83, 44.83, 44.83, 44.83, 29.15, 29.15, 29.15, 29.15, 29.15, 33.63, 33.63, 33.63, 33.63, 33.63, 33.32, 33.32, 33.32, 33.32, 33.32, 33.89, 33.89, 33.89, 33.89, 33.89, 34.67, 34.67, 34.67, 34.67, 34.67, 34.97, 34.97, 34.97, 34.97, 34.97, 35.2, 35.2, 35.2, 35.2, 35.2, 34.85, 34.85, 34.85, 34.85, 34.85, 34.82, 34.82, 34.82, 34.82, 34.82, 34.72, 34.72, 34.72, 34.72, 34.72, 33.33, 33.33, 33.33, 33.33, 33.33, 33.3, 33.3, 33.3, 33.3, 33.3, 32.39, 32.39, 32.39, 32.39, 32.39, 31.61, 31.61, 31.61, 31.61, 31.61, 30.02, 30.02, 30.02, 30.02, 30.02, 30.22, 30.22, 30.22, 30.22, 30.22, 30.15, 30.15, 30.15, 30.15, 30.15, 30.08, 30.08, 30.08, 30.08, 30.08, 29.98, 29.98, 29.98, 29.98, 29.98, 30.0, 30.0, 30.0, 30.0, 30.0, 30.29, 30.29, 30.29, 30.29, 30.29, 30.16, 30.16, 30.16, 30.16, 30.16, 30.29, 30.29, 30.29, 30.29, 30.29, 30.42, 30.42, 30.42, 30.42, 30.42, 30.34, 30.34, 30.34, 30.34, 30.34, 30.23, 30.23, 30.23, 30.23, 30.23, 30.6, 30.6, 30.6, 30.6, 30.6, 30.7, 30.7, 30.7, 30.7, 30.7, 30.91, 30.91, 30.91, 30.91, 30.91, 31.08, 31.08, 31.08, 31.08, 31.08, 31.13, 31.13, 31.13, 31.13, 31.13, 31.04, 31.04, 31.04, 31.04, 31.04, 30.93, 30.93, 30.93, 30.93, 30.93, 30.6, 30.6, 30.6, 30.6, 30.6, 30.36, 30.36, 30.36, 30.36, 30.36, 30.35, 30.35, 30.35, 30.35, 30.35, 30.53, 30.53, 30.53, 30.53, 30.53, 30.63, 30.63, 30.63, 30.63, 30.63, 30.83, 30.83, 30.83, 30.83, 30.83, 30.95, 30.95, 30.95, 30.95, 30.95, 30.83, 30.83, 30.83, 30.83, 30.83, 30.61, 30.61, 30.61, 30.61, 30.61, 30.58, 30.58, 30.58, 30.58, 30.58, 29.06, 29.06, 29.06, 29.06, 29.06, 29.1, 29.1, 29.1, 29.1, 29.1, 29.09, 29.09, 29.09, 29.09, 29.09, 29.25, 29.25, 29.25, 29.25, 29.25, 29.3, 29.3, 29.3, 29.3, 29.3, 29.37, 29.37, 29.37, 29.37, 29.37, 29.42, 29.42, 29.42, 29.42, 29.42, 29.31, 29.31, 29.31, 29.31, 29.31, 29.26, 29.26, 29.26, 29.26, 29.26, 29.24, 29.24, 29.24, 29.24, 29.24, 29.31, 29.31, 29.31, 29.31, 29.31, 29.47, 29.47, 29.47, 29.47, 29.47, 29.49, 29.49, 29.49, 29.49, 29.49, 29.64, 29.64, 29.64, 29.64, 29.64, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.69, 29.71]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.44, 0.44, 0.44, 0.44, 0.44, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.3, 0.3, 0.3, 0.3, 0.3, 0.24, 0.24, 0.24, 0.24, 0.24, 0.46, 0.46, 0.46, 0.46, 0.46, 0.31, 0.31, 0.31, 0.31, 0.31, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.32, 0.32, 0.32, 0.32, 0.32, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.49, 0.49, 0.49, 0.49, 0.49, 0.63, 0.63, 0.63, 0.63, 0.63, 0.59, 0.59, 0.59, 0.59, 0.59, 0.66, 0.66, 0.66, 0.66, 0.66, 0.09, 0.09, 0.09, 0.09, 0.09, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.32]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717431828 --> 1717432458
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0]
                    
Loading

@farris
Copy link
Author

farris commented Jun 5, 2024

I added some projection handling for phi3v in clip.cpp. This is not the actual way the tensor from clip is meant to be handled (see link) but it is performing decent enough to serve as a stop-gap until the proper handling is implemented. This way we also don't need to mute any assertions in the underlying tensor library

See the README for instructions 👍

@@ -86,7 +86,7 @@ def bytes_to_unicode():
ap.add_argument("--clip-model-is-openclip", action="store_true", required=False,
help="The clip model is from openclip (for ViT-SO400M type))")
ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.")
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp")
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not change the default

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 13, 2024

@farris I've just converted the phi3 model, that works nicely
The inference however is not working beyond the clip encoder.
At first glance something is not working as intended

 .\build\bin\llava-cli.exe  -m Q:\models\llava\phi3\Phi-3-mini-128k-instruct\ggml-model-f16.gguf --mmproj Q:\models\llava\phi3\Phi-3-vision-128k-instruct\vit\mmproj-model-f16.gguf --image C:\temp\LICENSE_DEMO.jpg -c 4096  --temp .1 -p "Describe all visible content and respond in JSON" -ngl 99 -n 100 --ignore-eos
.....................................................................................
clip_model_load: model name:   vit-large336-custom
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    378
clip_model_load: n_kv:         25
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 25 key-value pairs and 378 tensors from Q:\models\llava\phi3\Phi-3-vision-128k-instruct\vit\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vit-large336-custom
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp_phi
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:           clip.vision.image_grid_pinpoints arr[i32,10]      = [336, 672, 672, 336, 672, 672, 1008, ...
clip_model_load: - kv  17:          clip.vision.image_crop_resolution u32              = 224
clip_model_load: - kv  18:             clip.vision.image_aspect_ratio str              = anyres
clip_model_load: - kv  19:         clip.vision.image_split_resolution u32              = 224
clip_model_load: - kv  20:            clip.vision.mm_patch_merge_type str              = spatial_unpad
clip_model_load: - kv  21:              clip.vision.mm_projector_type str              = mlp_phi
clip_model_load: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  24:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     597.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  597.49 MB (378 tensors)
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   300.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: 5 segments encoded in   148.70 ms
encode_image_with_clip: image embedding created: 2880 tokens

encode_image_with_clip: image encoded in   164.38 ms by CLIP (    0.06 ms per image patch)



llama_print_timings:        load time =    3473.43 ms
llama_print_timings:      sample time =       0.03 ms /     1 runs   (    0.03 ms per token, 34482.76 tokens per second)
llama_print_timings: prompt eval time =     390.27 ms /  2928 tokens (    0.13 ms per token,  7502.44 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    3479.98 ms /  2929 tokens
At -b1 a ton of KQ tensor warnings come up, possibly unrelated
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 14, 2024

Update:
It looks like an escape condition was hit in the llava-cli client, that's what caused the zero tokens output.
When using the phi3 template it works.

I'm using the california drivers license image for OCR test and this prompt:
<|user|>\n<image>\nProvide a complete list of all you can see, including all text (id, date, name, sig,etc)<E|end|>\n<|assistant|>\n

I am now getting results which look quite promising, I would be super happy if the DL number of the license demo would not be that flawed.
Web space:

The image shows a California driver's license. The visible text includes 'DL' followed by a number '11234568', 'EXP 08/31/2014', 'END NONE', 'LN CARDHOLDER', 'FNIMA', 'SAMPLE', '2570 24TH STREET ANYTOWN, CA 95818', 'DOB 08/31/1977', 'RSTR NONE', '08311977', 'DONOR', 'VETERAN', 'SEX F', 'HAIR BRN', 'EYES BRN', 'HGT 5'-05"', 'WGT 125 lb', 'ISS 08/31/2009', and 'DD 00/00/0000NNNNAN/ANFD/YY'. The license also features a signature at the bottom left corner that reads 'Ima Cordhollad'. The background of the license includes a graphic of a bear and a figure of a Native American.

llava-cli:

The image shows a driver's license from California. The license is partially visible with various pieces of information. The visible text includes 'California', 'DRIVER LICENSE', 'IMA CARDHOLDER', 'DL 1123434568', 'EXP 08/31/2014', 'END NONE', 'DL 08/31/1977', 'END NONE', '083111977', '08/31/2009', '08/31/2009', ...

From my pov we only need to remove the default to phi3 in the python script, if possible fix the graph (ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid) then this should be merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to put this entire logic into llava_surgery_v2.py

@@ -86,7 +86,7 @@ def bytes_to_unicode():
ap.add_argument("--clip-model-is-openclip", action="store_true", required=False,
help="The clip model is from openclip (for ViT-SO400M type))")
ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.")
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp")
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi")
Copy link
Contributor

@cmp-nct cmp-nct Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change back to "mlp" default and add the phi one into "possible values" help text


mkdir phi3-vision
git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directories won't match up at this point as "git clone" creates their own subdirs.
renaming them would be better, and 2x mkdir could be removed


5) Create the visual gguf model:
```console
python examples/llava/convert-image-encoder-to-gguf.py -m phi3-fun/phi3-vision/vit --llava-projector phi3-fun/phi3-vision/vit/llava.projector --output-dir phi3-fun/phi3-vision/vit --clip-model-is-vision
Copy link
Contributor

@cmp-nct cmp-nct Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--projector-type mlp_phi
I don't think that changing the config.json should still be required when specifying this, that would remove one necessary manual step from the list


8) Invoke
```console
./llava-cli -m phi3-fun/phi3-base/ggml-model-f16.gguf --mmproj phi3-fun/phi3-vision/vit/mmproj-model-f16.gguf --image IMAGE -c 4096 --temp .1 -p "PROMPT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templating should be recommended.
The below one should be correct for phi3v
<|user|>\n<image>\nPROMPT<|end|>\n<|assistant|>\n

@farris
Copy link
Author

farris commented Jun 16, 2024

@cmp-nct Thanks for taking a look, and glad that you got it to work.
I'll add the changes soon. I'm still experimenting with the projection layer but perhaps we can merge this for now
and open an issue for consistency between this and the hf implementation.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 16, 2024

@cmp-nct Thanks for taking a look, and glad that you got it to work. I'll add the changes soon. I'm still experimenting with the projection layer but perhaps we can merge this for now and open an issue for consistency between this and the hf implementation.

That was also my thought, let's get it merged and if possible it would be great to fix the modelling issues from there on.
I have a test image I used for phi3, which is very well solved (not flawless but impressive) on the HF space for phi3 vision, but the fp16 ggml model (temp 0) fails in everything on the same picture.
It does barely see anything correct, mixes up a lot.
So something is wonky, maybe more than just the projector. Our CLIP model itself might have an issue.
I was not able to replicate tensor input/output on our clip compared to the official reference.

https://i.ibb.co/zhh8wKn/calculation.png

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 18, 2024

@farris
I noticed another, likely quite serious, discrepancy
Using the current method (due to the config used for the VIT) we have this set:
clip.vision.mm_patch_merge_type str = spatial_unpad
This enabled llava-next/1.6 preprocessing (you can see it in the 1700 tokens generated).

I've not installed the python backend on my PC yet, just looking at it:
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/image_processing_phi3_v.py

  1. padding_336() seems to pad with black instead of the MEAN color
  2. There is a hardcoded (in addition to config.json) only 144 tokens per 336 pixels
shapes = [[im.size[1], im.size[0]] for im in elems]
num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]
  1. so the image appears to be shaped to multiples of 336 height/width, each of those get 144 tokens (which for 4 such segments would result in our usual 576 tokens as in llava 1.5)

Based on that I think our current handling of Microsofts phi3-v is completely wrong, sadly

@mann1x
Copy link

mann1x commented Jun 21, 2024

@cmp-nct
As suggested by Lee Stott from MS team, I tagged this thread on HF and kindly asked to support the integration:

https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/40

Hopefully someone from the phi team will jump in to help and give some advice

@chigkim
Copy link

chigkim commented Sep 1, 2024

+100! Any update from MS?

@polarathene

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants