Skip to content

SIGFPE deep in MPI_INIT() #13290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mankoff opened this issue Jun 4, 2025 · 2 comments
Closed

SIGFPE deep in MPI_INIT() #13290

mankoff opened this issue Jun 4, 2025 · 2 comments

Comments

@mankoff
Copy link

mankoff commented Jun 4, 2025

I've just set up a new machine with Debian 13 Trixie which includes OpenMPI 5.0.7-1 and gfortan 14.2.0. Hardware is AMD Ryzen 7840U.

I'm developing the NASA GISS ModelE GCM, and the line with call MPI_INIT(rc) is causing a floating point exception. The backtrace is below, and Frame 18 at at model/MPI_Support/dist_grid_mod.F90:277 is the line call MPI_INIT(rc)

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x1555550232ba in ???
#1  0x155555022375 in ???
#2  0x155554d59def in ???
#3  0x155554408b43 in ???
#4  0x1555543c1892 in ???
#5  0x15555439cec4 in ???
#6  0x1555554fcc61 in ???
#7  0x1555542bd03b in ???
#8  0x1555542af328 in ???
#9  0x155553850789 in ???
#10  0x155553851163 in ???
#11  0x15555385b2d9 in ???
#12  0x15555469741b in ???
#13  0x15555469b429 in ???
#14  0x15555469c187 in ???
#15  0x15555469371f in ???
#16  0x1555546c449e in ???
#17  0x15555542bbc9 in ???
#0  0x1555550232ba in ???
#1  0x155555022375 in ???
#2  0x155554d59def in ???
#3  0x155554408b43 in ???
#4  0x1555543c1892 in ???
#5  0x15555439cec4 in ???
#6  0x1555554fcc61 in ???
#7  0x1555542bd03b in ???
#8  0x1555542af328 in ???
#9  0x155553850789 in ???
#10  0x155553851163 in ???
#11  0x15555385b2d9 in ???
#12  0x15555469741b in ???
#13  0x15555469b429 in ???
#14  0x15555469c187 in ???
#15  0x15555469371f in ???
#16  0x1555546c449e in ???
#17  0x15555542bbc9 in ???
#18  0x5555564fb94d in __dist_grid_mod_MOD_init_app
        at model/MPI_Support/dist_grid_mod.F90:277
#18  0x5555564fb94d in __dist_grid_mod_MOD_init_app
        at model/MPI_Support/dist_grid_mod.F90:277
#19  0x5555557cd185 in initializemodele
        at model/MODELE.f:588
#20  0x5555557ca97c in giss_modele_
        at model/MODELE.f:234
#21  0x5555557c4971 in modele_maindriver_
        at model/MODELE_DRV.f:27
#22  0x555555560b46 in MAIN__
        at model/main.F90:2
#23  0x555555560b96 in main
        at model/main.F90:3
#19  0x5555557cd185 in initializemodele
        at model/MODELE.f:588
#20  0x5555557ca97c in giss_modele_
        at model/MODELE.f:234
#21  0x5555557c4971 in modele_maindriver_
        at model/MODELE_DRV.f:27
#22  0x555555560b46 in MAIN__
        at model/main.F90:2
#23  0x555555560b96 in main
        at model/main.F90:3
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 500573 on node fw13 exited on
signal 8 (Floating point exception).
--------------------------------------------------------------------------
[Thread 0x1555542ff6c0 (LWP 500572) exited]
[Thread 0x1555545006c0 (LWP 500571) exited]
[Inferior 1 (process 500568) exited with code 0210]
(gdb) 

If I turn of FPE checking around that line, the model runs:

+     use ieee_exceptions, only: ieee_divide_by_zero, ieee_invalid, ieee_overflow, ieee_set_halting_mode

...

+   ! Disable FPE trapping before MPI_Init
+   call ieee_set_halting_mode(ieee_divide_by_zero, .false.)
+   call ieee_set_halting_mode(ieee_invalid, .false.)
+   call ieee_set_halting_mode(ieee_overflow, .false.)
+
    call MPI_INIT(rc)
    call setCommunicator(MPI_COMM_WORLD)
    call MPI_COMM_SIZE(COMMUNICATOR, NPES_WORLD, rc)
    call MPI_COMM_RANK(COMMUNICATOR, rank, rc)
+
+   ! Re-enable FPE trapping
+   call ieee_set_halting_mode(ieee_divide_by_zero, .true.)
+   call ieee_set_halting_mode(ieee_invalid, .true.)
+   call ieee_set_halting_mode(ieee_overflow, .true.)
+

This new system is new hardware, new OS, and updated gfortran and OpenMPI. In an attempt to isolate the issue I've done the following:

  • Tested on this new hardware with old dev environment in Docker. Debian 12 Bookworm, gfortran 12, OpenMPI 4.something. No issue -> Not hardware.
  • Tested with gfortran-12 on this OS, installed with apt install gfortran-12 and adjusting the Makefile. I assume this uses the same system installed OpenMPI 5.x. Issue exists -> Not gfortran.
  • Disabling FPE checking just for the MPI_INIT line above suggests this may be related to the new OpenMPI.

I am sorry but I am unable to create an MWE, but if someone did want to test this, I could help set up a dev environment. The latest GCM code is the last link at https://simplex.giss.nasa.gov/snapshots/ and the system can run in Docker (see https://github.com/nasa-giss/docker).

@mankoff
Copy link
Author

mankoff commented Jun 4, 2025

Found MWE. Opening new issue.

@mankoff mankoff closed this as completed Jun 4, 2025
@mankoff
Copy link
Author

mankoff commented Jun 4, 2025

See #13291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant