Skip to content

BLAS library build for AArch64 wheels #679

Closed
@nSircombe

Description

@nSircombe

There are a number of issues with the current AArch64 whl build in https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py which appear to be impacting the performance of the finished whl.

  1. OpenBLAS has not been built with USE_OPENMP=1.
    The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (OMP_NUM_THREADS times) for a simple TorchVision ResNet50 inference example OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

  2. OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support -mtune=neoverse-n1.
    OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for -mtune=neoverse-n1 from v9 onwards. So the build progresses with -march=armv8.2-a -mtune=cortex-a72 instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.

  3. The build has USE_EIGEN_FOR_BLAS set.
    This can be seen in the output of print(*torch.__config__.show().split("\n"), sep="\n"). As I understand it this should not be required if a BLAS library like OpenBLAS is provided.

  4. -march and -mtune do not appear to have been set for the PyTorch build.
    Building with -mcpu=native will chose the appropriate -march and -mtune for the host system (again this will have implications for portability).

Updating build__aarch64_wheel.py so that the OpenBLAS build uses:

LDFLAGS=-lgfortran make TARGET=NEOVERSEN1 USE_OPENMP=1 NO_SHARED=1 -j8

and the PyTorch build uses:

build_vars += f"OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 CXXFLAGS='-mcpu=native -O3'"

Results in:

  • the disappearance of the OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option. warning.
  • 30% speedup in a simple ResNet50 inference example
  • 70% fall in latency for a simple BERT example.

Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pypiPyPI, pip, wheel related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions