Skip to content

ENH, TST: improvements to CI testing #2108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tylerjereddy opened this issue May 2, 2019 · 28 comments
Closed

ENH, TST: improvements to CI testing #2108

tylerjereddy opened this issue May 2, 2019 · 28 comments

Comments

@tylerjereddy
Copy link
Contributor

There have been some discussions downstream of OpenBLAS about how we might help catch issues sooner, perhaps by contributing to your CI. I have a few thoughts here, which you may very well reject, but better to know that before I try to put time in or open a PR.

  1. OpenBLAS currently has access to 10 free parallel CI slots it isn't using, since OpenBLAS doesn't currently use Azure pipelines on PRs. We've had an excellent experience using Azure downstream alongside Travis.
  2. If you do adopt Azure and shift some of your jobs from your very-busy Travis matrix over there you may also benefit from freeing up a Travis slot for a ppc64le / POWER8 build, which we use downstream for native testing on that platform in CI.
  3. I also noticed a number of emulated ARM jobs in your travis matrix. You may also benefit from running natively, which is now available for free in CI on ARMv8 via Shippable.
  4. Adding an older OS test on Azure, like CentOS 6, to match i.e., manylinux1 / old build libs. This might need to be coordinated with R, Julia, and other downstream communities to see if there's some overlap in what is most suitable to test mutual interests.

Are you open to some of these ideas--could it be helpful? The worst part is probably that a maintainer would likely have to set up the app integrations for those CI services, but that's not too bad if we can submit the PRs to help build up the testing a bit more.

This may also help change the dialogue from us just showing up and asking for bug fixes to a way to help out a little bit so that both sides avoid rushing new releases for suddenly-discovered issues at or just after release time.

There's definitely some contentious debate about what the appropriate medium for the testing is. There's also the concern about being able to test newer archs like SkylakeX with the recent issues--that seems harder / not something more CI can easily solve unless the VMs will support the instructions somehow.

I think there's also a buildbot that tests OpenBLAS on some archs, though I'm not sure how regularly, and I believe i.e., Skylake may not be available there.

@isuruf
Copy link
Contributor

isuruf commented May 2, 2019

There's also the concern about being able to test newer archs like SkylakeX with the recent issues--that seems harder / not something more CI can easily solve unless the VMs will support the instructions somehow.

You can use Intel SDE. BLIS project uses SDE to check different arches. (https://github.com/flame/blis/blob/master/travis/do_sde.sh)

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

If you could integrate your ideas - namely manylinux1 as oldest imaginable glibc - in current CI config.

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 3, 2019

  1. Sounds good, but I do not have full administrative control over OpenBLAS (yet?), as evidenced by the outdated openblas.net site. (I am not even aware if/how the buildbots are maintained - suspect these are staggering on with whatever configuration they received 2+ years ago)
  2. I have run SDE to try and narrow down the SkylakeX DGEMM problem, but I found it to be surprisingly slow. EDIT:could not reproduce the slowness now, so probably did something silly like running full x86 instruction set emulation by mistake earlier (It also did not help that I did not receive any confirmation from the author of the problematic PR whether the SDE-derived reproducer also showed a problem on the actual hardware)
  3. I have only limited time to spend on OpenBLAS. Sometimes I have to trust that contributors know what they are doing, in particular if I have no reason to doubt their affiliation with a pertinent business or research institution. (And I do not think the DGEMM issue would have been caught by the current set of build-time tests performed by CI - and neither did the earlier drama over USE_TLS that shares some similarity)

@isuruf
Copy link
Contributor

isuruf commented May 3, 2019

With SDE I can reproduce the issue mentioned at numpy/numpy#13401 (On CoffeeLake emulating SkyLakeX) with 0.3.5.

@isuruf
Copy link
Contributor

isuruf commented May 3, 2019

@tylerjereddy, was numpy/numpy#13401 fixed with 0.3.6? I still see it with SDE.

(numpy) isuru@isuru:~/projects/OpenBLAS$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=HASWELL sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in sde_utils/cpuid/haswell.def -- python np13401.py 
Core: Haswell
-7.806378156741123e-17
(numpy) isuru@isuru:~/projects/OpenBLAS$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=SKYLAKEX sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in sde_utils/cpuid/skx.def -- python np13401.py 
Core: SkylakeX
-4.917741857331878

@tylerjereddy
Copy link
Contributor Author

@isuruf I believe the "fix" is actually a disabling of the kernel in OpenBLAS for now: #2061

I've just started playing with the SDE in CI on my fork. I suppose the fail before / succeed after scenario is actually for SDE set to skx and OpenBLAS not using the kernel by default?

I don't know if that "regression" test is easy to do from NumPy's perspective, and it will of course change when the instructions can be used some day.

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

OPENBLAS_CORETYPE= selects one or other core, like SANDYBRIDGE ZEN HASWELL SKYLAKEX

@isuruf
Copy link
Contributor

isuruf commented May 3, 2019

What I'm trying to say is SKYLAKEX is still problematic when used with SDE.

@tylerjereddy
Copy link
Contributor Author

So, if I tell the SDE to use skx, but I don't tell OPENBLAS which coretype it should use, should the linked patch prevent OPENBLAS from using the problematic Skylake kernel by default? It would fall back to some other kernel?

End users can solve the problem by using OPENBLAS_CORETYPE= as noted above if they are on Skylake with 0.3.5 OpenBLAS, but we don't want to have to ask them to do that, so I want to test for running on Skylake but having OpenBLAS fall back to whatever it falls back to if it can't use the problematic instructions.

@isuruf
Copy link
Contributor

isuruf commented May 3, 2019

So, if I tell the SDE to use skx, but I don't tell OPENBLAS which coretype it should use, should the linked patch prevent OPENBLAS from using the problematic Skylake kernel by default?

Yes, even if you tell OPENBLAS to use SKYLAKEX, problematic kernel will not be used in openblas 0.3.6. So, the numpy test should pass for any combination of SDE cpuid and OPENBLAS core type. (given that SDE cpuid is newer or equal to the OPENBLAS core type)

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

It was noted with 0.3.5-dev which is some development version from this year. Most likely problem was introduced with 0.3.3 , certainly not earlier. @isuruf could you confirm 0.3.6 solves the issue over 0.3.5 ?

@isuruf
Copy link
Contributor

isuruf commented May 3, 2019

0.3.6 doesn't. I tested with SDE, not actual hardware

@martin-frbg
Copy link
Collaborator

The 0.3.6 "fix" specifically disables the SkylakeX DGEMM kernel in response to reports in #1955 (a Julia issue) and #2029 (with a computational chemistry package) but none of the other SkylakeX kernels contributed by fenrus75. There is no guarantee that this will also solve any of the problems you appear to be seeing with numpy.

@tylerjereddy
Copy link
Contributor Author

There is no guarantee that this will also solve any of the problems you appear to be seeing with numpy.

Noted, I'll likely end up running NumPy full linear algebra test suite with currently-distributed vs. latest develop version of OpenBLAS linked, to see if it helps once I'm setup with the emulation stuff.

@tylerjereddy
Copy link
Contributor Author

Hmm, I'm still stuck getting Haswell reported from OpenBLAS 0.3.5.dev when I do this on Azure CI:

./sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in ./sde-external-8.35.0-2019-03-11-lin/misc/cpuid/skx/cpuid.def -- python -c "import numpy, ctypes, os; dll = ctypes.CDLL(numpy.core._ multiarray_umath.__file__); get_config = dll.openblas_get_config; get_config.restype=ctypes.c_char_p; res = get_config(); print('OpenBLAS get_config returned', str(res)); assert b'OpenBLAS 0. 3.5' in res"

I must be doing something wrong since I've not used the SDE before today. Can I not use those /cpuid/skx/cpuid.def paths directly from the untarred SDE source? I don't see an error message.

@isuruf
Copy link
Contributor

isuruf commented May 4, 2019

@tylerjereddy, did you build with DYNAMIC_ARCH=1? CAn you post the log from Azure?

@tylerjereddy
Copy link
Contributor Author

Most recent log should be visible here: https://dev.azure.com/tylerjereddy/numpy-test/_build/results?buildId=821

My source for the CI runs in my fork: https://github.com/tylerjereddy/numpy/blob/intel-sde-azure/azure-pipelines.yml#L42

The OpenBLAS is pre-built from the MacPython ecosystem, but I believe we tend to use DYNAMIC_ARCH=1

@isuruf
Copy link
Contributor

isuruf commented May 4, 2019

@tylerjereddy, manylinux1 gcc doesn't support AVX512, so the openblas library on numpy wheels can only support upto haswell even on newer cpus like skylakex.

@tylerjereddy
Copy link
Contributor Author

@isuruf So the original issue was reported on Mac, using pip/ wheels & OpenBLAS built for Mac, presumably with clang--so I'll have to switch to Mac then I suppose.

Perhaps I was confused because you mentioned reproduction with the linux SDE of the issue above in your examples -- I assume you built NumPy locally with your own OpenBLAS builds in those cases?

@isuruf
Copy link
Contributor

isuruf commented May 4, 2019

I assume you built NumPy locally with your own OpenBLAS builds in those cases?

No, used conda's openblas and numpy

@brada4
Copy link
Contributor

brada4 commented May 4, 2019

Most recent log

The log shows OpenBLAS 0.3.3 ? That includes quite experimental USE_TLS=1 in Makefile.rule that you must patch out from the file as it cannot be overridden with parameters later.

@matthew-brett
Copy link
Contributor

@tylerjereddy - maybe we should start building manylinux2010 wheels to test with? It's a fairly easy set of fixes to our current setup:

https://github.com/matthew-brett/multibuild/issues/238

@isuruf
Copy link
Contributor

isuruf commented May 4, 2019

maybe we should start building manylinux2010 wheels to test with?

Or build clang 8.0.0 on manylinux1, which I do for some wheels. (Can't do it on CI except Azure)

@tylerjereddy
Copy link
Contributor Author

The log shows OpenBLAS 0.3.3

Yeah, that's not true, but admittedly confusing--the version numbers on OpenBLAS file names produced in the "MacPython" ecosystem are a little problematic. Luckily, the file names do usually contain a hash for the OpenBLAS develop commit used for the build if it is not a stable release.

So, that 0.3.3 file name is actually some commit on the 0.3.5.dev OpenBLAS. People don't believe me when I say we need better version detection & auditing infrastructure, but anyway we now have get_config for the version string in OpenBLAS that we can probe from ctypes, which is a great recent addition from the OpenBLAS side.

maybe we should start building manylinux2010 wheels

Good idea--for now, I'm just trying to put out the current fire for NumPy / SciPy wheels.

@xianyi
Copy link
Collaborator

xianyi commented May 8, 2019

Hi all,

I just set up Azure pipeline. How could I configure yaml? Is it similar to travis-ci?

@tylerjereddy
Copy link
Contributor Author

@xianyi Awesome! You added the Azure pipelines app for the project?

The YML config is a little different, but pretty well documented now. Many downstream projects like NumPy & SciPy have pretty detailed examples, etc., so could be slowly built up as long as the integration / PR hooks are active now.

Usually there are one or two tricky steps in the setup to get it just right.

@tylerjereddy
Copy link
Contributor Author

Yes! Azure pipelines now running in #2121.

It can also be useful to turn on boards in the admin config, since they're currently needed for showing some things like coverage reports:

image

@xianyi
Copy link
Collaborator

xianyi commented May 8, 2019

I enable the board. Please use the following URL.
https://dev.azure.com/xianyi/openblas/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants