Skip to content

Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gustrd opened this issue Dec 4, 2024 · 11 comments

Comments

@gustrd
Copy link
Contributor

gustrd commented Dec 4, 2024

Name and Version

version: 4248 (3b4f2e3) built with clang version 19.1.4 for aarch64-unknown-linux-android24

Operating systems

Linux

GGML backends

CPU

Hardware

Device - Zenfone 9:  - Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | AARCH64_REPACK = 1 |
lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A510
    Model:              3
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p3
    CPU(s) scaling MHz: 77%
    CPU max MHz:        2016.0000
    CPU min MHz:        307.2000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1
                        sha2 crc32 atomics fphp asimdhp
                        cpuid asimdrdm jscvt fcma lrcpc
                        dcpop sha3 sm3 sm4 asimddp sha51
                        2 asimdfhm dit uscat ilrcpc flag
                        m ssbs sb paca pacg dcpodp flagm
                        2 frint i8mm bf16 bti

Models

bartowski/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/tree/main

Problem description & steps to reproduce

Performance in the Old Version:
For Q4_0 and IQ4_NL, performance was normal and as expected, given that repacking was not applied in these cases.
The Q4_0_4_4 prompt processing performance was exceptional in the old version, significantly surpassing other formats.

Performance in the New Version:
The Q4_0_4_4 format now shows drastically reduced performance, falling below the levels of Q4_0 and IQ4_NL. This is a notable regression from the old version's behavior.
Repacking for Q4_0 and IQ4_NL appears to be ineffective in the new version. Instead of improving performance, it is slightly slower compared to the old version. This does not align with expectations of repacking offering at least similar performance to the previous implementation of Q4_0_4_4.

i8mm Support Issue:
Even though lscpu indicates support for i8mm, llama.cpp does not detect or leverage this feature in the new version.

First Bad Commit

I could not pinpoint the first commit, but I found that before the Neon changes [f2f5c3b (4105)] I still had the expected performance.

Relevant log output

Previous version (3 weeks ago) - build: f2f5c3b6 (4105)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          8.60 ± 0.95 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.56 ± 0.79 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          9.21 ± 0.97 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          6.73 ± 1.10 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |         16.71 ± 0.96 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          5.67 ± 0.27 |

Current version (main yesterday) - build: 3b4f2e33 (4248)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          6.42 ± 1.73 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          2.59 ± 0.10 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          7.47 ± 0.88 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.12 ± 0.57 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          2.28 ± 0.32 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          1.12 ± 0.33 |
@slaren
Copy link
Member

slaren commented Dec 4, 2024

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | AARCH64_REPACK = 1 |

This shows that the build does not have support for I8MM. You would need to add it to the arch flags to enable it.

@max-krasnyansky
Copy link
Collaborator

I was going to ask how are you building it and suggest to include i8mm in CFLAGS.

@gustrd
Copy link
Contributor Author

gustrd commented Dec 4, 2024

Thanks a lot, @slaren! As always, your insights are spot on.

Following the discussion in #402, I used the following architecture flags for my build:

Build Info:
Commit: 3b4f2e3 (4248)
Flags: -march=native+nosve

model size params backend threads test t/s
llama 3B Q4_0 1.98 GiB 3.61 B CPU 4 pp512 55.83 ± 1.13
llama 3B Q4_0 1.98 GiB 3.61 B CPU 4 tg128 14.98 ± 0.17
llama 3B IQ4_NL - 4.5 bpw 1.98 GiB 3.61 B CPU 4 pp512 16.63 ± 1.07
llama 3B IQ4_NL - 4.5 bpw 1.98 GiB 3.61 B CPU 4 tg128 7.19 ± 0.14
llama 3B Q4_0_4_4 1.98 GiB 3.61 B CPU 4 pp512 51.25 ± 1.53
llama 3B Q4_0_4_4 1.98 GiB 3.61 B CPU 4 tg128 14.52 ± 0.09

The performance gains are impressive across all model types! However, IQ4_NL seems to underperform slightly, likely because it still relies only on NEON instructions.

I also believe the Android build documentation could be enhanced to include detailed guidance on using architecture flags like -march=native+nosve. This would help others replicate and optimize performance on different platforms. If this sounds like a good idea, I’d be happy to open a PR to propose these changes.

@gustrd gustrd closed this as completed Dec 4, 2024
@slaren
Copy link
Member

slaren commented Dec 4, 2024

I would like to add support for ARM for building multiple variants and loading automatically the best one, in a similar way this is done for x86 in #10606 / #10626. This should address this problem, and make it easier to distribute applications built with llama.cpp. I am not sure what variants would be good to include for ARM though, I would appreciate the insights of people with more experience with ARM.

@slaren
Copy link
Member

slaren commented Dec 4, 2024

IQ4_NL seems to underperform slightly, likely because it still relies only on NEON instructions.

There is an implementation that can use __ARM_FEATURE_DOTPROD. If you enable that in your build, performance should be significantly better.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Dec 4, 2024

Build Info: Commit: 3b4f2e3 (4248) Flags: -march=native+nosve

Hmm. I don't think native makes sense for the cross-compiler builds. It means "auto-detect the host architecture".
You should use -march=armv8.6-a in your case.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Dec 4, 2024

I would like to add support for ARM for building multiple variants and loading automatically the best one, in a similar way this is done for x86 in #10606 / #10626. This should address this problem, and make it easier to distribute applications built with llama.cpp. I am not sure what variants would be good to include for ARM though, I would appreciate the insights of people with more experience with ARM.

gcc manual page has a nice summary

      -march=name
           Specify the name of the target architecture and, optionally, one or more feature modifiers.  This option has the form -march=arch{+[no]feature}*.

           The table below summarizes the permissible values for arch and the features that they enable by default:

           arch value : Architecture : Includes by default
           armv8-a : Armv8-A : +fp, +simd
           armv8.1-a : Armv8.1-A : armv8-a, +crc, +lse, +rdma
           armv8.2-a : Armv8.2-A : armv8.1-a
           armv8.3-a : Armv8.3-A : armv8.2-a, +pauth
           armv8.4-a : Armv8.4-A : armv8.3-a, +flagm, +fp16fml, +dotprod
           armv8.5-a : Armv8.5-A : armv8.4-a, +sb, +ssbs, +predres
           armv8.6-a : Armv8.6-A : armv8.5-a, +bf16, +i8mm
           armv8.7-a : Armv8.7-A : armv8.6-a, +ls64
           armv8.8-a : Armv8.8-a : armv8.7-a, +mops
           armv9-a : Armv9-A : armv8.5-a, +sve, +sve2
           armv9.1-a : Armv9.1-A : armv9-a, +bf16, +i8mm
           armv9.2-a : Armv9.2-A : armv9.1-a, +ls64
           armv9.3-a : Armv9.3-A : armv9.2-a, +mops

Updated after looking at our CMake files

I'd probably pick the following most common flavors

  • armv6-m (raspberry pi 1)
  • armv7-m (raspberry pi 2)
  • armv7-a (older android phones)
  • armv8.2-a (older android phones)
  • armv8.4-a (dotprod)
  • armv8.6-a (i8mm)
  • armv9.1-a (sve & i8mm)

@a-ghorbani
Copy link
Contributor

I would like to add support for ARM for building multiple variants and loading automatically the best one, in a similar way this is done for x86 in #10606 / #10626. This should address this problem, and make it easier to distribute applications built with llama.cpp. I am not sure what variants would be good to include for ARM though, I would appreciate the insights of people with more experience with ARM.

hey @slaren, I'm keen to hear if you've had the chance to experiment with any ARM variants (eg the suggestions form @max-krasnyansky) and whether you've settled on specific ones that work well?

For reference, llama.rn builds these variants for different archs:

# ARM64 targets
    build_library("rnllama_v8_4_fp16_dotprod_sve" "-march=armv8.4-a+fp16+dotprod+sve")
    build_library("rnllama_v8_4_fp16_dotprod_i8mm_sve" "-march=armv8.4-a+fp16+dotprod+i8mm+sve")
    build_library("rnllama_v8_4_fp16_dotprod_i8mm" "-march=armv8.4-a+fp16+dotprod+i8mm")
    build_library("rnllama_v8_4_fp16_dotprod" "-march=armv8.4-a+fp16+dotprod")
    build_library("rnllama_v8_2_fp16_dotprod" "-march=armv8.2-a+fp16+dotprod")
    build_library("rnllama_v8_2_fp16" "-march=armv8.2-a+fp16")
    build_library("rnllama_v8" "-march=armv8-a")

This is also what we use under the hood in PocketPal.

@slaren
Copy link
Member

slaren commented Jan 6, 2025

The missing part for supporting building multiple ARM variants is the feature detection. Basically, we need to implement a backend score function that can determine the ARM features supported in the current system, and compare them to the features included in the build. For x86, this is implemented in cpu-feats-x86.cpp. ARM seems to be a lot more complicated than x86 and strongly dependent on the operating system, and I don't really have the expertise or the hardware to test this. I will probably implement feature detection first for macOS Apple silicon since that's fairly straightforward and @ggerganov has access to several devices, but other platforms like Android will take more time and will likely depend on contributions from the community.

@ggerganov
Copy link
Member

Yup, let's start with Apple and later we will organize contributions to support more platforms.

Regarding the OS-dependent features, if I understand correctly, the CPU could support some feature as supported, but the OS could have it disabled for whatever reason. I wonder if we have encountered such a situation in practice? If it's something that does not happen often, we can probably ignore this for now?

@slaren
Copy link
Member

slaren commented Jan 7, 2025

What I meant when I said that it depends on the OS is that unlike x86, the ARM instructions to obtain the supported features are privileged, they cannot be used from user code, so it is necessary to use system functions. Some features also need OS support, usually when new registers are added since that's more data that needs to be saved in the thread context. SVE specifically seems like it needs to be explicitly enabled before it can be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants