ENH, TST: improvements to CI testing #2108

tylerjereddy · 2019-05-02T22:15:28Z

There have been some discussions downstream of OpenBLAS about how we might help catch issues sooner, perhaps by contributing to your CI. I have a few thoughts here, which you may very well reject, but better to know that before I try to put time in or open a PR.

OpenBLAS currently has access to 10 free parallel CI slots it isn't using, since OpenBLAS doesn't currently use Azure pipelines on PRs. We've had an excellent experience using Azure downstream alongside Travis.
If you do adopt Azure and shift some of your jobs from your very-busy Travis matrix over there you may also benefit from freeing up a Travis slot for a ppc64le / POWER8 build, which we use downstream for native testing on that platform in CI.
I also noticed a number of emulated ARM jobs in your travis matrix. You may also benefit from running natively, which is now available for free in CI on ARMv8 via Shippable.
Adding an older OS test on Azure, like CentOS 6, to match i.e., manylinux1 / old build libs. This might need to be coordinated with R, Julia, and other downstream communities to see if there's some overlap in what is most suitable to test mutual interests.

Are you open to some of these ideas--could it be helpful? The worst part is probably that a maintainer would likely have to set up the app integrations for those CI services, but that's not too bad if we can submit the PRs to help build up the testing a bit more.

This may also help change the dialogue from us just showing up and asking for bug fixes to a way to help out a little bit so that both sides avoid rushing new releases for suddenly-discovered issues at or just after release time.

There's definitely some contentious debate about what the appropriate medium for the testing is. There's also the concern about being able to test newer archs like SkylakeX with the recent issues--that seems harder / not something more CI can easily solve unless the VMs will support the instructions somehow.

I think there's also a buildbot that tests OpenBLAS on some archs, though I'm not sure how regularly, and I believe i.e., Skylake may not be available there.

The text was updated successfully, but these errors were encountered:

isuruf · 2019-05-02T23:25:10Z

There's also the concern about being able to test newer archs like SkylakeX with the recent issues--that seems harder / not something more CI can easily solve unless the VMs will support the instructions somehow.

You can use Intel SDE. BLIS project uses SDE to check different arches. (https://github.com/flame/blis/blob/master/travis/do_sde.sh)

brada4 · 2019-05-03T05:05:41Z

If you could integrate your ideas - namely manylinux1 as oldest imaginable glibc - in current CI config.

martin-frbg · 2019-05-03T06:26:07Z

Sounds good, but I do not have full administrative control over OpenBLAS (yet?), as evidenced by the outdated openblas.net site. (I am not even aware if/how the buildbots are maintained - suspect these are staggering on with whatever configuration they received 2+ years ago)
I have run SDE to try and narrow down the SkylakeX DGEMM problem, ~~but I found it to be surprisingly slow.~~ EDIT:could not reproduce the slowness now, so probably did something silly like running full x86 instruction set emulation by mistake earlier (It also did not help that I did not receive any confirmation from the author of the problematic PR whether the SDE-derived reproducer also showed a problem on the actual hardware)
I have only limited time to spend on OpenBLAS. Sometimes I have to trust that contributors know what they are doing, in particular if I have no reason to doubt their affiliation with a pertinent business or research institution. (And I do not think the DGEMM issue would have been caught by the current set of build-time tests performed by CI - and neither did the earlier drama over USE_TLS that shares some similarity)

isuruf · 2019-05-03T16:38:19Z

With SDE I can reproduce the issue mentioned at numpy/numpy#13401 (On CoffeeLake emulating SkyLakeX) with 0.3.5.

isuruf · 2019-05-03T16:51:14Z

@tylerjereddy, was numpy/numpy#13401 fixed with 0.3.6? I still see it with SDE.

(numpy) isuru@isuru:~/projects/OpenBLAS$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=HASWELL sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in sde_utils/cpuid/haswell.def -- python np13401.py 
Core: Haswell
-7.806378156741123e-17
(numpy) isuru@isuru:~/projects/OpenBLAS$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=SKYLAKEX sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in sde_utils/cpuid/skx.def -- python np13401.py 
Core: SkylakeX
-4.917741857331878

tylerjereddy · 2019-05-03T17:52:15Z

@isuruf I believe the "fix" is actually a disabling of the kernel in OpenBLAS for now: #2061

I've just started playing with the SDE in CI on my fork. I suppose the fail before / succeed after scenario is actually for SDE set to skx and OpenBLAS not using the kernel by default?

I don't know if that "regression" test is easy to do from NumPy's perspective, and it will of course change when the instructions can be used some day.

brada4 · 2019-05-03T17:55:59Z

OPENBLAS_CORETYPE= selects one or other core, like SANDYBRIDGE ZEN HASWELL SKYLAKEX

isuruf · 2019-05-03T17:58:21Z

What I'm trying to say is SKYLAKEX is still problematic when used with SDE.

tylerjereddy · 2019-05-03T18:02:34Z

So, if I tell the SDE to use skx, but I don't tell OPENBLAS which coretype it should use, should the linked patch prevent OPENBLAS from using the problematic Skylake kernel by default? It would fall back to some other kernel?

End users can solve the problem by using OPENBLAS_CORETYPE= as noted above if they are on Skylake with 0.3.5 OpenBLAS, but we don't want to have to ask them to do that, so I want to test for running on Skylake but having OpenBLAS fall back to whatever it falls back to if it can't use the problematic instructions.

isuruf · 2019-05-03T18:06:25Z

So, if I tell the SDE to use skx, but I don't tell OPENBLAS which coretype it should use, should the linked patch prevent OPENBLAS from using the problematic Skylake kernel by default?

Yes, even if you tell OPENBLAS to use SKYLAKEX, problematic kernel will not be used in openblas 0.3.6. So, the numpy test should pass for any combination of SDE cpuid and OPENBLAS core type. (given that SDE cpuid is newer or equal to the OPENBLAS core type)

brada4 · 2019-05-03T18:33:58Z

It was noted with 0.3.5-dev which is some development version from this year. Most likely problem was introduced with 0.3.3 , certainly not earlier. @isuruf could you confirm 0.3.6 solves the issue over 0.3.5 ?

isuruf · 2019-05-03T18:35:50Z

0.3.6 doesn't. I tested with SDE, not actual hardware

martin-frbg · 2019-05-03T18:48:28Z

The 0.3.6 "fix" specifically disables the SkylakeX DGEMM kernel in response to reports in #1955 (a Julia issue) and #2029 (with a computational chemistry package) but none of the other SkylakeX kernels contributed by fenrus75. There is no guarantee that this will also solve any of the problems you appear to be seeing with numpy.

tylerjereddy · 2019-05-03T18:55:57Z

There is no guarantee that this will also solve any of the problems you appear to be seeing with numpy.

Noted, I'll likely end up running NumPy full linear algebra test suite with currently-distributed vs. latest develop version of OpenBLAS linked, to see if it helps once I'm setup with the emulation stuff.

tylerjereddy · 2019-05-04T00:02:52Z

Hmm, I'm still stuck getting Haswell reported from OpenBLAS 0.3.5.dev when I do this on Azure CI:

./sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in ./sde-external-8.35.0-2019-03-11-lin/misc/cpuid/skx/cpuid.def -- python -c "import numpy, ctypes, os; dll = ctypes.CDLL(numpy.core._ multiarray_umath.__file__); get_config = dll.openblas_get_config; get_config.restype=ctypes.c_char_p; res = get_config(); print('OpenBLAS get_config returned', str(res)); assert b'OpenBLAS 0. 3.5' in res"

I must be doing something wrong since I've not used the SDE before today. Can I not use those /cpuid/skx/cpuid.def paths directly from the untarred SDE source? I don't see an error message.

isuruf · 2019-05-04T00:17:11Z

@tylerjereddy, did you build with DYNAMIC_ARCH=1? CAn you post the log from Azure?

tylerjereddy · 2019-05-04T00:22:02Z

Most recent log should be visible here: https://dev.azure.com/tylerjereddy/numpy-test/_build/results?buildId=821

My source for the CI runs in my fork: https://github.com/tylerjereddy/numpy/blob/intel-sde-azure/azure-pipelines.yml#L42

The OpenBLAS is pre-built from the MacPython ecosystem, but I believe we tend to use DYNAMIC_ARCH=1

isuruf · 2019-05-04T00:43:01Z

@tylerjereddy, manylinux1 gcc doesn't support AVX512, so the openblas library on numpy wheels can only support upto haswell even on newer cpus like skylakex.

tylerjereddy · 2019-05-04T00:50:25Z

@isuruf So the original issue was reported on Mac, using pip/ wheels & OpenBLAS built for Mac, presumably with clang--so I'll have to switch to Mac then I suppose.

Perhaps I was confused because you mentioned reproduction with the linux SDE of the issue above in your examples -- I assume you built NumPy locally with your own OpenBLAS builds in those cases?

isuruf · 2019-05-04T00:51:14Z

I assume you built NumPy locally with your own OpenBLAS builds in those cases?

No, used conda's openblas and numpy

brada4 · 2019-05-04T04:56:12Z

Most recent log

The log shows OpenBLAS 0.3.3 ? That includes quite experimental USE_TLS=1 in Makefile.rule that you must patch out from the file as it cannot be overridden with parameters later.

matthew-brett · 2019-05-04T11:40:20Z

@tylerjereddy - maybe we should start building manylinux2010 wheels to test with? It's a fairly easy set of fixes to our current setup:

https://github.com/matthew-brett/multibuild/issues/238

isuruf · 2019-05-04T17:02:16Z

maybe we should start building manylinux2010 wheels to test with?

Or build clang 8.0.0 on manylinux1, which I do for some wheels. (Can't do it on CI except Azure)

tylerjereddy · 2019-05-05T20:41:44Z

The log shows OpenBLAS 0.3.3

Yeah, that's not true, but admittedly confusing--the version numbers on OpenBLAS file names produced in the "MacPython" ecosystem are a little problematic. Luckily, the file names do usually contain a hash for the OpenBLAS develop commit used for the build if it is not a stable release.

So, that 0.3.3 file name is actually some commit on the 0.3.5.dev OpenBLAS. People don't believe me when I say we need better version detection & auditing infrastructure, but anyway we now have get_config for the version string in OpenBLAS that we can probe from ctypes, which is a great recent addition from the OpenBLAS side.

maybe we should start building manylinux2010 wheels

Good idea--for now, I'm just trying to put out the current fire for NumPy / SciPy wheels.

xianyi · 2019-05-08T02:01:06Z

Hi all,

I just set up Azure pipeline. How could I configure yaml? Is it similar to travis-ci?

tylerjereddy · 2019-05-08T02:07:35Z

@xianyi Awesome! You added the Azure pipelines app for the project?

The YML config is a little different, but pretty well documented now. Many downstream projects like NumPy & SciPy have pretty detailed examples, etc., so could be slowly built up as long as the integration / PR hooks are active now.

Usually there are one or two tricky steps in the setup to get it just right.

tylerjereddy · 2019-05-08T02:20:45Z

Yes! Azure pipelines now running in #2121.

It can also be useful to turn on boards in the admin config, since they're currently needed for showing some things like coverage reports:

xianyi · 2019-05-08T02:47:07Z

I enable the board. Please use the following URL.
https://dev.azure.com/xianyi/openblas/

tylerjereddy mentioned this issue May 3, 2019

BLD: use latest OpenBLAS MacPython/openblas-libs#2

Merged

tylerjereddy mentioned this issue May 4, 2019

TST: Reproducer for SkylakeX Issue gh-13401 numpy/numpy#13466

Closed

martin-frbg closed this as completed Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH, TST: improvements to CI testing #2108

ENH, TST: improvements to CI testing #2108

tylerjereddy commented May 2, 2019

isuruf commented May 2, 2019

brada4 commented May 3, 2019

martin-frbg commented May 3, 2019 •

edited

Loading

isuruf commented May 3, 2019 •

edited

Loading

isuruf commented May 3, 2019

tylerjereddy commented May 3, 2019

brada4 commented May 3, 2019

isuruf commented May 3, 2019

tylerjereddy commented May 3, 2019

isuruf commented May 3, 2019

brada4 commented May 3, 2019 •

edited

Loading

isuruf commented May 3, 2019

martin-frbg commented May 3, 2019

tylerjereddy commented May 3, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019 •

edited

Loading

brada4 commented May 4, 2019 •

edited

Loading

matthew-brett commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 5, 2019

xianyi commented May 8, 2019

tylerjereddy commented May 8, 2019

tylerjereddy commented May 8, 2019

xianyi commented May 8, 2019

ENH, TST: improvements to CI testing #2108

ENH, TST: improvements to CI testing #2108

Comments

tylerjereddy commented May 2, 2019

isuruf commented May 2, 2019

brada4 commented May 3, 2019

martin-frbg commented May 3, 2019 • edited Loading

isuruf commented May 3, 2019 • edited Loading

isuruf commented May 3, 2019

tylerjereddy commented May 3, 2019

brada4 commented May 3, 2019

isuruf commented May 3, 2019

tylerjereddy commented May 3, 2019

isuruf commented May 3, 2019

brada4 commented May 3, 2019 • edited Loading

isuruf commented May 3, 2019

martin-frbg commented May 3, 2019

tylerjereddy commented May 3, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 4, 2019

isuruf commented May 4, 2019 • edited Loading

brada4 commented May 4, 2019 • edited Loading

matthew-brett commented May 4, 2019

isuruf commented May 4, 2019

tylerjereddy commented May 5, 2019

xianyi commented May 8, 2019

tylerjereddy commented May 8, 2019

tylerjereddy commented May 8, 2019

xianyi commented May 8, 2019

martin-frbg commented May 3, 2019 •

edited

Loading

isuruf commented May 3, 2019 •

edited

Loading

brada4 commented May 3, 2019 •

edited

Loading

isuruf commented May 4, 2019 •

edited

Loading

brada4 commented May 4, 2019 •

edited

Loading