Skip to content

v5+ corrupts data when using MPI IO indexed view #11917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chhu opened this issue Sep 8, 2023 · 7 comments
Closed

v5+ corrupts data when using MPI IO indexed view #11917

chhu opened this issue Sep 8, 2023 · 7 comments

Comments

@chhu
Copy link

chhu commented Sep 8, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Reproduced with v5.0.x branch and main. Works in v4.x and 3.x.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build:

ompi_info
                 Package: Open MPI huettig@alkione Distribution
                Open MPI: 5.1.0a1
  Open MPI repo revision: v2.x-dev-10918-g9f82d1cfcb
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.1.0a1
                  Prefix: /plp_scr1/utils/mpi_test/ompi_build
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: huettig
           Configured on: Mon Sep  4 14:04:42 UTC 2023
          Configure host: alkione
  Configure command line: '--prefix=/plp_scr1/utils/mpi_test/ompi_build'
                          '--with-cuda=/plp_scr1/utils/cuda'
                          '--with-flex=/flex'
                          '--with-ucx=/plp_scr1/utils/mpi_test/ucx_build'
                          '--enable-mca-no-build=btl-uct'
                          '--disable-man-pages'
                Built by: huettig
                Built on: Mo 4. Sep 14:16:36 UTC 2023
              Built host: alkione
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 13.2.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.1.0)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.1.0)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.1.0)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.1.0)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.1.0)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.1.0)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.1.0)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.1.0)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.1.0)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.1.0)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.1.0)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.1.0)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.1.0)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.1.0)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.1.0)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.1.0)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)

Subs:

 22fe51cb7a961b6060fc5c48e659237cbe162566 3rd-party/openpmix (v1.1.3-3872-g22fe51cb)
 ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RedHat 8
  • Computer hardware: AMD x64
  • Network type: Infiniband

Details of the problem

Our simulation uses two big indexed file types, one for scalar, one for vector. There are write and read operations for multiple scalars/vectors in a single file. Here are some details:

  • WRITE on the same file type works, verified.
  • READ reads corrupted data after using the vector mask and switching back to scalar. (heavily indexed, varying block/displacement sizes, all doubles)
  • Verified correct offsets, no stack/heap corruption.
  • OMPI versions < 5 work
  • Not file system dependent, tested on NFS4 and locally mounted FS. Same result.
  • No MPI error is returned
  • The corrupted data seems partially leaked from previous reads (just judging from the double values)
  • Running with one task and just one big block and zero displacement works

Sorry I cannot provide a demo code. However I can relatively easy test / recompile this scenario from a different branch / version.
Best,
Christian

@edgargabriel
Copy link
Member

edgargabriel commented Sep 8, 2023

@chhu thank you for the report, but I am afraid I will need a bit more information to identify the problem. I re-ran this morning our internal test suite as well as the hdf5 1.12.2 testsuite, and all tests that were passing with 4.1.5 are also passing with the current main branch (One test from the hdf5 testsuite is failing and I can try to look into that over the weekend, but I double checked that the same test is also failing with 4.1.5, so I don't think its a regression).

Is there any chance to create a small reproducer to debug the issue?

@edgargabriel
Copy link
Member

just for documentation purposes, the one test that is failing from the hdf5-1.12.2 testsuite with open-mpi 4.1.5 is also failing if using the romio321 component, so it doesn't seem to be ompio specific

In addition, the same test from the hdf5-1.14.2 testsuite is passing, so it was most likely a bug in the test itself, which is now fixed. So at this point unless you can provide more information or a reproducer, I have nothing that I can look into.

@chhu
Copy link
Author

chhu commented Sep 11, 2023

I understand without a reproducing code this is tough. Since it involves complex domain decomposition already it would be tough for me to strip it down, and the sim is unfortunately not open source. In the meantime I tried to compile / reproduce everything on a different OS / kernel / environment (Debian, no CUDA, no InfiniBand) and unfortunately succeeded with the same results.

What I could do is share the code responsible for the reads with you. At least you see the MPI calls involved and its order. I use the same MPI_Type for reading and writing, I guess this cannot cause problems?

I suspect a stack corruption on MPI side, probably even caused by ill parameters from my side. Is it possible to enable an mpi build with libasan (gcc -fsanitize=address)?

@edgargabriel
Copy link
Member

@chhu can you please ping me by email and we can communicate directly? I am happy to keep things confidential. Note however, that the parallel I/O part is not part of my current job description, so I am doing it on the side in the evenings and weekends. (If you cannot find my non-professional email, please ping me on linkedIn and will send it to you in the chat)

Do I understand your statement in the first paragraph correctly, that if you compile Open MPI without GPU support, your test passes? That would be interesting, because this is one area of the parallel I/O code that has changed between 4.1.x and 5.0.x, so this could be hint.

Regarding address sanitizer, I have not used that with Open MPI yet. I did use valgrind with memcheck a few times, but it only makes sense if the issue is easily reproduced with few processes and in s reasonable amount of time (e.g. a few seconds), otherwise the output is overwhelming.

@chhu
Copy link
Author

chhu commented Sep 11, 2023

Thanks, just connected via linkedin, hope that works.
Unfortunately the bug is independent of GPU, and it existed since the beginnings of v5. The address sanitizer is surprisingly helpful, it slows down the code only by a factor of 3-5 but stops and comments only at violations with details where it happened if compiled with debug symbols.

As I mentioned before, the bug is quirky. It seems to fail only if the view is switched back from vector field to scalar field. Maybe I find some time to look at OMPI sources to hunt the bug myself, it should be in MPI_File_read_all or MPI_File_set_view.
The status from MPI_File_read_all also returns error=0 on all ranks.

@edgargabriel
Copy link
Member

@chhu has provided me with a reproducer directly, and I was able to identify the problem. As part of introducing support for file atomicity in 615f4ef we tried to make the data sieving routines more robust as well, and this introduced the issue identified in this ticket. Since support for file atomicity is not available in the 4.1.x series, the issue doesn't exist there. A patch is coming shortly.

Thank you!

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Sep 15, 2023
as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines.
Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of
bytes read or written,
but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying
data out of the temporary buffer used in the data sieving at all.

This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file
multiple times.

Fixes issue open-mpi#11917

Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass.

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Sep 19, 2023
as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines.
Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of
bytes read or written,
but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying
data out of the temporary buffer used in the data sieving at all.

This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file
multiple times.

Fixes issue open-mpi#11917

Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass.

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Sep 19, 2023
as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines.
Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of
bytes read or written,
but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying
data out of the temporary buffer used in the data sieving at all.

This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file
multiple times.

Fixes issue open-mpi#11917

Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass.

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit fb3b68f)
@janjust
Copy link
Contributor

janjust commented Sep 19, 2023

fixed with #11933

@janjust janjust closed this as completed Sep 19, 2023
bosilca pushed a commit to bosilca/ompi that referenced this issue Feb 14, 2024
as part of introducing atomicity support for ompi v5.0, we also tried to improve the robustness in some file I/O routines.
Unfortunately, this also introduced a bug since ret_code returned by a function does not necessarily contain the number of
bytes read or written,
but could contain the last value (e.g. 0). The value was however used in a subsequent calculation and we ended not copying
data out of the temporary buffer used in the data sieving at all.

This commit also simplifies some of the logic in the while loop, no need to retry to read past the end of the file
multiple times.

Fixes issue open-mpi#11917

Code was tested with the reproducer provided as part of the issue, our internal testsuite, and the hdf5-1.4.2 testsuite, all tests pass.

Signed-off-by: Edgar Gabriel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants