-
Notifications
You must be signed in to change notification settings - Fork 11.9k
PHI3-vision gguf conversion #7705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ggml.c
Outdated
GGML_ASSERT(ggml_can_mul_mat(a, b)); | ||
GGML_ASSERT(!ggml_is_transposed(a)); | ||
// GGML_ASSERT(ggml_can_mul_mat(a, b)); | ||
// GGML_ASSERT(!ggml_is_transposed(a)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These asserts should not be removed. If you hit them, then there is most likely something wrong with the input data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be it. I think it will have to be implemented in clip.cpp
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/image_embedding_phi3_v.py#L219:~:text=%23%20(num_crops%2C%2012%2C%202,num_img_tokens%2C%201024*4)
I added some projection handling for phi3v in See the README for instructions 👍 |
@@ -86,7 +86,7 @@ def bytes_to_unicode(): | |||
ap.add_argument("--clip-model-is-openclip", action="store_true", required=False, | |||
help="The clip model is from openclip (for ViT-SO400M type))") | |||
ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.") | |||
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp") | |||
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not change the default
@farris I've just converted the phi3 model, that works nicely
|
Update: I'm using the california drivers license image for OCR test and this prompt: I am now getting results which look quite promising, I would be super happy if the DL number of the license demo would not be that flawed.
llava-cli:
From my pov we only need to remove the default to phi3 in the python script, if possible fix the graph ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider to put this entire logic into llava_surgery_v2.py
@@ -86,7 +86,7 @@ def bytes_to_unicode(): | |||
ap.add_argument("--clip-model-is-openclip", action="store_true", required=False, | |||
help="The clip model is from openclip (for ViT-SO400M type))") | |||
ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.") | |||
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp") | |||
ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change back to "mlp" default and add the phi one into "possible values" help text
|
||
mkdir phi3-vision | ||
git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directories won't match up at this point as "git clone" creates their own subdirs.
renaming them would be better, and 2x mkdir could be removed
|
||
5) Create the visual gguf model: | ||
```console | ||
python examples/llava/convert-image-encoder-to-gguf.py -m phi3-fun/phi3-vision/vit --llava-projector phi3-fun/phi3-vision/vit/llava.projector --output-dir phi3-fun/phi3-vision/vit --clip-model-is-vision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--projector-type mlp_phi
I don't think that changing the config.json should still be required when specifying this, that would remove one necessary manual step from the list
|
||
8) Invoke | ||
```console | ||
./llava-cli -m phi3-fun/phi3-base/ggml-model-f16.gguf --mmproj phi3-fun/phi3-vision/vit/mmproj-model-f16.gguf --image IMAGE -c 4096 --temp .1 -p "PROMPT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Templating should be recommended.
The below one should be correct for phi3v
<|user|>\n<image>\nPROMPT<|end|>\n<|assistant|>\n
@cmp-nct Thanks for taking a look, and glad that you got it to work. |
That was also my thought, let's get it merged and if possible it would be great to fix the modelling issues from there on. |
@farris I've not installed the python backend on my PC yet, just looking at it:
Based on that I think our current handling of Microsofts phi3-v is completely wrong, sadly |
@cmp-nct https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/40 Hopefully someone from the phi team will jump in to help and give some advice |
+100! Any update from MS? |
This PR adds functionality to convert Phi-3-vision-128k-instruct to gguf format.
Eventually this should be cleaned and potentially separated from the LLAVA directory, even though these models are basically the same, the nomenclature might be a bit confusing @ggerganov
Lastly, I have very little c++ knowledge, but I was able to get this to work but muting some ggml library assertions, this needs to be examined more closely
phi3k-vision-github.mov