Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp

### Name and Version

version: 4248 (3b4f2e33) built with clang version 19.1.4 for aarch64-unknown-linux-android24

### Operating systems

Linux

### GGML backends

CPU

### Hardware

Device - Zenfone 9:  - Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform
```
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | AARCH64_REPACK = 1 |
```
```
lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A510
    Model:              3
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p3
    CPU(s) scaling MHz: 77%
    CPU max MHz:        2016.0000
    CPU min MHz:        307.2000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1
                        sha2 crc32 atomics fphp asimdhp
                        cpuid asimdrdm jscvt fcma lrcpc
                        dcpop sha3 sm3 sm4 asimddp sha51
                        2 asimdfhm dit uscat ilrcpc flag
                        m ssbs sb paca pacg dcpodp flagm
                        2 frint i8mm bf16 bti
```

### Models

bartowski/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/tree/main

### Problem description & steps to reproduce

**Performance in the Old Version:**
For Q4_0 and IQ4_NL, performance was normal and as expected, given that repacking was not applied in these cases.
The Q4_0_4_4 prompt processing performance was exceptional in the old version, significantly surpassing other formats.

**Performance in the New Version:**
The Q4_0_4_4 format now shows drastically reduced performance, falling below the levels of Q4_0 and IQ4_NL. This is a notable regression from the old version's behavior.
Repacking for Q4_0 and IQ4_NL appears to be ineffective in the new version. Instead of improving performance, it is slightly slower compared to the old version. This does not align with expectations of repacking offering at least similar performance to the previous implementation of Q4_0_4_4.

**i8mm Support Issue:**
Even though lscpu indicates support for i8mm, llama.cpp does not detect or leverage this feature in the new version.

### First Bad Commit

I could not pinpoint the first commit, but I found that before the Neon changes [f2f5c3b6 (4105)] I still had the expected performance.

### Relevant log output

```shell
Previous version (3 weeks ago) - build: f2f5c3b6 (4105)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          8.60 ± 0.95 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.56 ± 0.79 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          9.21 ± 0.97 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          6.73 ± 1.10 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |         16.71 ± 0.96 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          5.67 ± 0.27 |

Current version (main yesterday) - build: 3b4f2e33 (4248)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          6.42 ± 1.73 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          2.59 ± 0.10 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          7.47 ± 0.88 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.12 ± 0.57 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          2.28 ± 0.32 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          1.12 ± 0.33 |
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions