v5+ corrupts data when using MPI IO indexed view #11917

chhu · 2023-09-08T11:08:04Z

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Reproduced with v5.0.x branch and main. Works in v4.x and 3.x.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build:

ompi_info
                 Package: Open MPI huettig@alkione Distribution
                Open MPI: 5.1.0a1
  Open MPI repo revision: v2.x-dev-10918-g9f82d1cfcb
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.1.0a1
                  Prefix: /plp_scr1/utils/mpi_test/ompi_build
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: huettig
           Configured on: Mon Sep  4 14:04:42 UTC 2023
          Configure host: alkione
  Configure command line: '--prefix=/plp_scr1/utils/mpi_test/ompi_build'
                          '--with-cuda=/plp_scr1/utils/cuda'
                          '--with-flex=/flex'
                          '--with-ucx=/plp_scr1/utils/mpi_test/ucx_build'
                          '--enable-mca-no-build=btl-uct'
                          '--disable-man-pages'
                Built by: huettig
                Built on: Mo 4. Sep 14:16:36 UTC 2023
              Built host: alkione
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 13.2.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.1.0)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.1.0)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.1.0)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.1.0)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.1.0)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.1.0)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.1.0)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.1.0)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.1.0)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.1.0)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.1.0)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.1.0)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.1.0)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.1.0)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.1.0)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.1.0)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)

Subs:

 22fe51cb7a961b6060fc5c48e659237cbe162566 3rd-party/openpmix (v1.1.3-3872-g22fe51cb)
 ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

Operating system/version: RedHat 8
Computer hardware: AMD x64
Network type: Infiniband

Details of the problem

Our simulation uses two big indexed file types, one for scalar, one for vector. There are write and read operations for multiple scalars/vectors in a single file. Here are some details:

WRITE on the same file type works, verified.
READ reads corrupted data after using the vector mask and switching back to scalar. (heavily indexed, varying block/displacement sizes, all doubles)
Verified correct offsets, no stack/heap corruption.
OMPI versions < 5 work
Not file system dependent, tested on NFS4 and locally mounted FS. Same result.
No MPI error is returned
The corrupted data seems partially leaked from previous reads (just judging from the double values)
Running with one task and just one big block and zero displacement works

Sorry I cannot provide a demo code. However I can relatively easy test / recompile this scenario from a different branch / version.
Best,
Christian

The text was updated successfully, but these errors were encountered:

edgargabriel · 2023-09-08T14:34:04Z

@chhu thank you for the report, but I am afraid I will need a bit more information to identify the problem. I re-ran this morning our internal test suite as well as the hdf5 1.12.2 testsuite, and all tests that were passing with 4.1.5 are also passing with the current main branch (One test from the hdf5 testsuite is failing and I can try to look into that over the weekend, but I double checked that the same test is also failing with 4.1.5, so I don't think its a regression).

Is there any chance to create a small reproducer to debug the issue?

edgargabriel · 2023-09-08T15:02:18Z

just for documentation purposes, the one test that is failing from the hdf5-1.12.2 testsuite with open-mpi 4.1.5 is also failing if using the romio321 component, so it doesn't seem to be ompio specific

In addition, the same test from the hdf5-1.14.2 testsuite is passing, so it was most likely a bug in the test itself, which is now fixed. So at this point unless you can provide more information or a reproducer, I have nothing that I can look into.

chhu · 2023-09-11T07:36:17Z

I understand without a reproducing code this is tough. Since it involves complex domain decomposition already it would be tough for me to strip it down, and the sim is unfortunately not open source. In the meantime I tried to compile / reproduce everything on a different OS / kernel / environment (Debian, no CUDA, no InfiniBand) and unfortunately succeeded with the same results.

What I could do is share the code responsible for the reads with you. At least you see the MPI calls involved and its order. I use the same MPI_Type for reading and writing, I guess this cannot cause problems?

I suspect a stack corruption on MPI side, probably even caused by ill parameters from my side. Is it possible to enable an mpi build with libasan (gcc -fsanitize=address)?

edgargabriel · 2023-09-11T14:19:11Z

@chhu can you please ping me by email and we can communicate directly? I am happy to keep things confidential. Note however, that the parallel I/O part is not part of my current job description, so I am doing it on the side in the evenings and weekends. (If you cannot find my non-professional email, please ping me on linkedIn and will send it to you in the chat)

Do I understand your statement in the first paragraph correctly, that if you compile Open MPI without GPU support, your test passes? That would be interesting, because this is one area of the parallel I/O code that has changed between 4.1.x and 5.0.x, so this could be hint.

Regarding address sanitizer, I have not used that with Open MPI yet. I did use valgrind with memcheck a few times, but it only makes sense if the issue is easily reproduced with few processes and in s reasonable amount of time (e.g. a few seconds), otherwise the output is overwhelming.

chhu · 2023-09-11T20:06:11Z

Thanks, just connected via linkedin, hope that works.
Unfortunately the bug is independent of GPU, and it existed since the beginnings of v5. The address sanitizer is surprisingly helpful, it slows down the code only by a factor of 3-5 but stops and comments only at violations with details where it happened if compiled with debug symbols.

As I mentioned before, the bug is quirky. It seems to fail only if the view is switched back from vector field to scalar field. Maybe I find some time to look at OMPI sources to hunt the bug myself, it should be in MPI_File_read_all or MPI_File_set_view.
The status from MPI_File_read_all also returns error=0 on all ranks.

edgargabriel · 2023-09-15T13:59:59Z

@chhu has provided me with a reproducer directly, and I was able to identify the problem. As part of introducing support for file atomicity in 615f4ef we tried to make the data sieving routines more robust as well, and this introduced the issue identified in this ticket. Since support for file atomicity is not available in the 4.1.x series, the issue doesn't exist there. A patch is coming shortly.

Thank you!

as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines. Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of bytes read or written, but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying data out of the temporary buffer used in the data sieving at all. This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file multiple times. Fixes issue open-mpi#11917 Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass. Signed-off-by: Edgar Gabriel <[email protected]>

as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines. Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of bytes read or written, but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying data out of the temporary buffer used in the data sieving at all. This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file multiple times. Fixes issue open-mpi#11917 Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass. Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit fb3b68f)

janjust · 2023-09-19T19:40:11Z

fixed with #11933

as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines. Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of bytes read or written, but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying data out of the temporary buffer used in the data sieving at all. This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file multiple times. Fixes issue open-mpi#11917 Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass. Signed-off-by: Edgar Gabriel <[email protected]>

jsquyres added Target: main Target: v5.0.x labels Sep 12, 2023

jsquyres added this to the v5.0.0 milestone Sep 12, 2023

edgargabriel self-assigned this Sep 12, 2023

edgargabriel mentioned this issue Sep 15, 2023

fbtl/posix: fix data-sieving calculations #11932

Merged

edgargabriel mentioned this issue Sep 19, 2023

fbtl/posix: fix data-sieving calculations -v 5.0 #11933

Merged

janjust closed this as completed Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5+ corrupts data when using MPI IO indexed view #11917

v5+ corrupts data when using MPI IO indexed view #11917

chhu commented Sep 8, 2023

edgargabriel commented Sep 8, 2023 •

edited

Loading

edgargabriel commented Sep 8, 2023

chhu commented Sep 11, 2023

edgargabriel commented Sep 11, 2023

chhu commented Sep 11, 2023

edgargabriel commented Sep 15, 2023

janjust commented Sep 19, 2023

v5+ corrupts data when using MPI IO indexed view #11917

v5+ corrupts data when using MPI IO indexed view #11917

Comments

chhu commented Sep 8, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

edgargabriel commented Sep 8, 2023 • edited Loading

edgargabriel commented Sep 8, 2023

chhu commented Sep 11, 2023

edgargabriel commented Sep 11, 2023

chhu commented Sep 11, 2023

edgargabriel commented Sep 15, 2023

janjust commented Sep 19, 2023

edgargabriel commented Sep 8, 2023 •

edited

Loading