-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This shows that the build does not have support for I8MM. You would need to add it to the arch flags to enable it. |
I was going to ask how are you building it and suggest to include i8mm in CFLAGS. |
Thanks a lot, @slaren! As always, your insights are spot on. Following the discussion in #402, I used the following architecture flags for my build: Build Info:
The performance gains are impressive across all model types! However, IQ4_NL seems to underperform slightly, likely because it still relies only on NEON instructions. I also believe the Android build documentation could be enhanced to include detailed guidance on using architecture flags like -march=native+nosve. This would help others replicate and optimize performance on different platforms. If this sounds like a good idea, I’d be happy to open a PR to propose these changes. |
I would like to add support for ARM for building multiple variants and loading automatically the best one, in a similar way this is done for x86 in #10606 / #10626. This should address this problem, and make it easier to distribute applications built with llama.cpp. I am not sure what variants would be good to include for ARM though, I would appreciate the insights of people with more experience with ARM. |
There is an implementation that can use |
Hmm. I don't think |
gcc manual page has a nice summary
Updated after looking at our CMake files I'd probably pick the following most common flavors
|
hey @slaren, I'm keen to hear if you've had the chance to experiment with any ARM variants (eg the suggestions form @max-krasnyansky) and whether you've settled on specific ones that work well? For reference,
This is also what we use under the hood in PocketPal. |
The missing part for supporting building multiple ARM variants is the feature detection. Basically, we need to implement a backend score function that can determine the ARM features supported in the current system, and compare them to the features included in the build. For x86, this is implemented in |
Yup, let's start with Apple and later we will organize contributions to support more platforms. Regarding the OS-dependent features, if I understand correctly, the CPU could support some feature as supported, but the OS could have it disabled for whatever reason. I wonder if we have encountered such a situation in practice? If it's something that does not happen often, we can probably ignore this for now? |
What I meant when I said that it depends on the OS is that unlike x86, the ARM instructions to obtain the supported features are privileged, they cannot be used from user code, so it is necessary to use system functions. Some features also need OS support, usually when new registers are added since that's more data that needs to be saved in the thread context. SVE specifically seems like it needs to be explicitly enabled before it can be used. |
Name and Version
version: 4248 (3b4f2e3) built with clang version 19.1.4 for aarch64-unknown-linux-android24
Operating systems
Linux
GGML backends
CPU
Hardware
Device - Zenfone 9: - Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform
Models
bartowski/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/tree/main
Problem description & steps to reproduce
Performance in the Old Version:
For Q4_0 and IQ4_NL, performance was normal and as expected, given that repacking was not applied in these cases.
The Q4_0_4_4 prompt processing performance was exceptional in the old version, significantly surpassing other formats.
Performance in the New Version:
The Q4_0_4_4 format now shows drastically reduced performance, falling below the levels of Q4_0 and IQ4_NL. This is a notable regression from the old version's behavior.
Repacking for Q4_0 and IQ4_NL appears to be ineffective in the new version. Instead of improving performance, it is slightly slower compared to the old version. This does not align with expectations of repacking offering at least similar performance to the previous implementation of Q4_0_4_4.
i8mm Support Issue:
Even though lscpu indicates support for i8mm, llama.cpp does not detect or leverage this feature in the new version.
First Bad Commit
I could not pinpoint the first commit, but I found that before the Neon changes [f2f5c3b (4105)] I still had the expected performance.
Relevant log output
The text was updated successfully, but these errors were encountered: