BLAS library build for AArch64 wheels

There are a number of issues with the current AArch64 whl build in [https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py](build_aarch64_wheel.py) which appear to be impacting the performance of the finished whl.

1. OpenBLAS has not been built with `USE_OPENMP=1`.
The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (`OMP_NUM_THREADS` times) for a simple TorchVision ResNet50 inference example `OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.`

2. OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support `-mtune=neoverse-n1`.
OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for   `-mtune=neoverse-n1` _from_ v9 onwards. So the build progresses with `-march=armv8.2-a -mtune=cortex-a72` instead. _Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance._

3. The build has `USE_EIGEN_FOR_BLAS` set.
This can be seen in the output of `print(*torch.__config__.show().split("\n"), sep="\n")`. As I understand it this should not be required if a BLAS library like OpenBLAS is provided. 

4. `-march` and `-mtune` do not appear to have been set for the PyTorch build.
Building with `-mcpu=native` will chose the appropriate `-march` and `-mtune` for the host system (again this will have implications for portability).

Updating `build__aarch64_wheel.py`  so that the OpenBLAS build uses:
```
LDFLAGS=-lgfortran make TARGET=NEOVERSEN1 USE_OPENMP=1 NO_SHARED=1 -j8
```
and the PyTorch build uses:
```
build_vars += f"OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 CXXFLAGS='-mcpu=native -O3'"
```

Results in:
- the disappearance of the `OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.` warning.
- 30% speedup in a simple ResNet50 inference example
- 70% fall in latency for a simple BERT example.


Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BLAS library build for AArch64 wheels #679

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BLAS library build for AArch64 wheels #679

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions