Description
There are a number of issues with the current AArch64 whl build in https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py which appear to be impacting the performance of the finished whl.
-
OpenBLAS has not been built with
USE_OPENMP=1
.
The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (OMP_NUM_THREADS
times) for a simple TorchVision ResNet50 inference exampleOpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
-
OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support
-mtune=neoverse-n1
.
OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for-mtune=neoverse-n1
from v9 onwards. So the build progresses with-march=armv8.2-a -mtune=cortex-a72
instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance. -
The build has
USE_EIGEN_FOR_BLAS
set.
This can be seen in the output ofprint(*torch.__config__.show().split("\n"), sep="\n")
. As I understand it this should not be required if a BLAS library like OpenBLAS is provided. -
-march
and-mtune
do not appear to have been set for the PyTorch build.
Building with-mcpu=native
will chose the appropriate-march
and-mtune
for the host system (again this will have implications for portability).
Updating build__aarch64_wheel.py
so that the OpenBLAS build uses:
LDFLAGS=-lgfortran make TARGET=NEOVERSEN1 USE_OPENMP=1 NO_SHARED=1 -j8
and the PyTorch build uses:
build_vars += f"OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 CXXFLAGS='-mcpu=native -O3'"
Results in:
- the disappearance of the
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
warning. - 30% speedup in a simple ResNet50 inference example
- 70% fall in latency for a simple BERT example.
Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.