-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion
@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.
@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.
@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).
@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?
@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0
@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter).
As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.