Skip to content

Regression with current pre-0.2.9 git and the elk code #329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martin-frbg opened this issue Dec 15, 2013 · 13 comments
Closed

Regression with current pre-0.2.9 git and the elk code #329

martin-frbg opened this issue Dec 15, 2013 · 13 comments
Assignees
Milestone

Comments

@martin-frbg
Copy link
Collaborator

Just a quick heads up - I will try to pinpoint the problem if possible later:
Using the ELK "computational chemistry" code from elk.sourceforge.net I see lots of failures in the test problems distributed with the code when I build it against the current git version of openBLAS optimized for either Haswell or Sandybridge.
Using 0.2.8 built for Sandybridge, all is well on this i7-4770 (though openblas gives no measurable speedup on these small problems).
To reproduce:

  1. get elk-2.2.10.tgz from sourceforge
  2. unpack it and run the included "setup" script to create make.inc
  3. in make.inc, specify the openblas library instead of the included lapack&blas
  4. run make, and then "make test"
  5. With 0.2.8, all 17 tests pass within a total runtime of around 155 seconds, with
    0.2.9, runtime is more than doubled (due to the test cases failing to converge)
    and most tests report failure.
    One hint: in test-002, valgrind emits lots of "use of uninitialized value" warnings in calls to dgemv_t, dgemm_kernel, zscal_k etc.
@ghost ghost assigned wernsaar Dec 16, 2013
@xianyi
Copy link
Collaborator

xianyi commented Dec 16, 2013

@wernsaar , could you look at this issue?
Thank you

@wernsaar
Copy link
Contributor

On 15.12.2013 23:36, martin-frbg wrote:

Just a quick heads up - I will try to pinpoint the problem if possible later:
Using the ELK "computational chemistry" code from elk.sourceforge.net I see lots of failures in the test problems distributed with the code when I build it against the current git version of openBLAS optimized for either Haswell or Sandybridge.
Using 0.2.8 built for Sandybridge, all is well on this i7-4770 (though openblas gives no measurable speedup on these small problems).
To reproduce:

  1. get elk-2.2.10.tgz from sourceforge
  2. unpack it and run the included "setup" script to create make.inc
  3. in make.inc, specify the openblas library instead of the included lapack&blas
  4. run make, and then "make test"
  5. With 0.2.8, all 17 tests pass within a total runtime of around 155 seconds, with
    0.2.9, runtime is more than doubled (due to the test cases failing to converge)
    and most tests report failure.
    One hint: in test-002, valgrind emits lots of "use of uninitialized value" warnings in calls to dgemv_t, dgemm_kernel, zscal_k etc.

Reply to this email directly or view it on GitHub:
#329
Hi,

if you want to build for haswell, piledriver or bulldozer, you need
recent versions for gcc and binutils, and yuo need valgrind-3.9.0
or newer

Werner

@martin-frbg
Copy link
Collaborator Author

Build system in question is opensuse 12.3, so fairly recent (binutils-2.23, gcc472, using gcc482 does not solve the problem). Will do further analysis with valgrind 3.9 instead of the 3.8.1 used for the above. Please note that a sandybridge build of 0.2.8 works without errors, while 0.2.9 sandybridge is unusable on the same system.

@martin-frbg
Copy link
Collaborator Author

Just for the record, updating binutils to 2.24 did not change anything. (Neither did updating valgrind change anything
fundamental about the slew of warnings - but I do have to concede that it generates a similar (high) number of complaints for 0.2.8 although that one manages to yield the correct results).

@martin-frbg
Copy link
Collaborator Author

Finally got around to taking another look - it turned out the problem with 0.2.9-rc1 is specific to openmp: When Elk is compiled without the "-fopenmp" from its default make.inc settings, all its tests pass on Haswell with 0.2.9-rc1. Conversely, a -fopenmp build linked against 0.2.9-rc1 fails even on nehalem architecture, where 0.2.8 works well (provided that it was built with USE_THREAD=0, USE_OPENMP=1)

@martin-frbg
Copy link
Collaborator Author

The problem apparently was introduced well before the Haswell branch was merged. Bisecting now.

@martin-frbg
Copy link
Collaborator Author

dfd1064 is the first bad commit
commit dfd1064
Author: Zhang Xianyi [email protected]
Date: Sat Nov 2 15:09:33 2013 +0800

refs #287. Don't enable OpenMP for netlib LAPACK sequential Fortran codes.

@martin-frbg
Copy link
Collaborator Author

Have confirmed now that removing the distinction between F(P)FLAGS and LAPACK_F(P)FLAGS introduced by the above change to Makefile.system fixes my problem also in current git head.

@xianyi
Copy link
Collaborator

xianyi commented Jan 23, 2014

@martin-frbg , Thank you for the investigation.

I added dfd1064 to fix the SEGFAULT with OpenMP on Windows.

@martin-frbg
Copy link
Collaborator Author

Yes, I saw that but it was not clear to me if that was a real fix, and not just papering over a different problem.
If "SEGFAULT on Windows" trumps "wrong result on (at least) Linux", can the change be made "#ifdef WINDOWS"
please ?

@xianyi
Copy link
Collaborator

xianyi commented Jan 24, 2014

Please try develop branch.
Thank you again.

@martin-frbg
Copy link
Collaborator Author

Thank you. (Might it make sense to revisit #287 now that 0.2.9 contains a newer LAPACK ?)

@martin-frbg
Copy link
Collaborator Author

Will see if I can get openblas&elk built on a windows/mingw system in the near future for additional insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants