Skip to content

MPI_Win_create failure under 4.0.0 when UCX enabled #6201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AdamSimpson opened this issue Dec 17, 2018 · 15 comments
Closed

MPI_Win_create failure under 4.0.0 when UCX enabled #6201

AdamSimpson opened this issue Dec 17, 2018 · 15 comments

Comments

@AdamSimpson
Copy link

AdamSimpson commented Dec 17, 2018

What version of Open MPI are you using?

v4.0.0 release

UCX is at v1.4.0

Configured with ./configure --with-ucx

Please describe the system on which you are running

  • Operating system/version: Ubuntu/16.04, Ubuntu/18.04
  • Computer hardware: x86 CPU
  • Network type: single host
  • no IB, no xpmem, no knem

Details of the problem

Calls to MPI_Win_create fail with the following:

$ mpirun -np 2 ./a.out
[HOST] *** An error occurred in MPI_Win_create
[HOST] *** reported by process [858718209,1]
[HOST] *** on communicator MPI_COMM_WORLD
[HOST] *** MPI_ERR_WIN: invalid window
[HOST] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[HOST] ***    and potentially your MPI job)
[HOST] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[HOST] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I can get the test to run using the following variations:

$ mpirun -np 2 --mca pml ^ucx ./a.out
$ mpirun -np 2 --mca pml ob1 --mca btl self,vader ./a.out
$ mpirun -np 2 --mca btl ^vader ./a.out
$ OMPI_MCA_btl_vader_single_copy_mechanism=none mpirun -np 2 ./a.out
$ mpirun -np 2 --mca osc ^rdma ./a.out

Compiling OMPI without UCX also seems to work as well.

The tests above were all conducted with test.c, provided below.


test.c:

/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil ; -*- */
/*
 *
 *  (C) 2003 by Argonne National Laboratory.
 *      See COPYRIGHT in top-level directory.
 */
#include <mpi.h>
#include <stdio.h>

#define ELEM_SIZE 8

int main( int argc, char *argv[] )
{
    int     rank;
    int     errors = 0, all_errors = 0;
    int    *flavor, *model, flag;
    void   *buf;
    MPI_Win window;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    /** Create using MPI_Win_create() **/

    if (rank > 0)
      MPI_Alloc_mem(rank*ELEM_SIZE, MPI_INFO_NULL, &buf);
    else
      buf = NULL;

    MPI_Win_create(buf, rank*ELEM_SIZE, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &window);
    MPI_Win_get_attr(window, MPI_WIN_CREATE_FLAVOR, &flavor, &flag);

    if (!flag) {
      printf("%d: MPI_Win_create - Error, no flavor\n", rank);
      errors++;
    } else if (*flavor != MPI_WIN_FLAVOR_CREATE) {
      printf("%d: MPI_Win_create - Error, bad flavor (%d)\n", rank, *flavor);
      errors++;
    }   

    MPI_Win_get_attr(window, MPI_WIN_MODEL, &model, &flag);

    if (!flag) {
      printf("%d: MPI_Win_create - Error, no model\n", rank);
      errors++;
    } else if ( ! (*model == MPI_WIN_SEPARATE || *model == MPI_WIN_UNIFIED) ) { 
      printf("%d: MPI_Win_create - Error, bad model (%d)\n", rank, *model);
      errors++;
    }   

    MPI_Win_free(&window);

    if (buf)
      MPI_Free_mem(buf);

    MPI_Finalize();
}
@jladd-mlnx
Copy link
Member

@xinzhao3 please take a look.

@hjelmn
Copy link
Member

hjelmn commented Dec 20, 2018

Can you run with mpirun --mca osc_base_verbose 100. This is really odd.

@AdamSimpson
Copy link
Author

AdamSimpson commented Dec 20, 2018

Let me know if anything else might help track this issue down:

$ mpirun -np 2 --mca osc_base_verbose 100 ./a.out

[HOST] mca: base: components_register: registering framework osc components
[HOST] mca: base: components_register: found loaded component rdma
[HOST] mca: base: components_register: component rdma register function successful
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: component sm register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component pt2pt
[HOST] mca: base: components_register: component pt2pt register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_open: opening osc components
[HOST] mca: base: components_open: found loaded component rdma
[HOST] mca: base: components_open: found loaded component sm
[HOST] mca: base: components_open: component sm open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: found loaded component pt2pt
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] mca: base: components_register: registering framework osc components
[HOST] mca: base: components_register: found loaded component rdma
[HOST] mca: base: components_register: component rdma register function successful
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: component sm register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component pt2pt
[HOST] mca: base: components_register: component pt2pt register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_open: opening osc components
[HOST] mca: base: components_open: found loaded component rdma
[HOST] mca: base: components_open: found loaded component sm
[HOST] mca: base: components_open: component sm open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: found loaded component pt2pt
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] mca: base: close: unloading component monitoring
[HOST] mca: base: close: unloading component monitoring
[HOST] rdma component destroying window with id 3
[HOST] rdma component destroying window with id 3
[HOST] [[12763,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[HOST] *** An error occurred in MPI_Win_create
[HOST] *** reported by process [836435969,0]
[HOST] *** on communicator MPI_COMM_WORLD
[HOST] *** MPI_ERR_WIN: invalid window
[HOST] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[HOST] ***    and potentially your MPI job)

Incase pml and btl debug is also useful:

$ mpirun -np 2 --mca pml_base_verbose 100 --mca btl_base_verbose 100 --mca osc_base_verbose 100 ./a.out
[HOST] pmix_mca_base_component_repository_open: unable to open mca_gds_ds21: /usr/local/lib/libmca_common_dstore.so.0: undefined symbol: OPAL_MCA_PMIX4X_pmix_pshmem (ignored)
[HOST] pmix_mca_base_component_repository_open: unable to open mca_gds_ds21: /usr/local/lib/libmca_common_dstore.so.0: undefined symbol: OPAL_MCA_PMIX4X_pmix_pshmem (ignored)
[HOST] pmix_mca_base_component_repository_open: unable to open mca_gds_ds21: /usr/local/lib/libmca_common_dstore.so.0: undefined symbol: OPAL_MCA_PMIX4X_pmix_pshmem (ignored)
[HOST] mca: base: components_register: registering framework btl components
[HOST] mca: base: components_register: found loaded component uct
[HOST] mca: base: components_register: registering framework btl components
[HOST] mca: base: components_register: found loaded component uct
[HOST] mca: base: components_register: component uct register function successful
[HOST] mca: base: components_register: component uct register function successful
[HOST] mca: base: components_register: found loaded component self
[HOST] mca: base: components_register: found loaded component self
[HOST] mca: base: components_register: component self register function successful
[HOST] mca: base: components_register: component self register function successful
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: found loaded component vader
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: found loaded component vader
[HOST] mca: base: components_register: component vader register function successful
[HOST] mca: base: components_register: component vader register function successful
[HOST] mca: base: components_register: found loaded component tcp
[HOST] mca: base: components_register: found loaded component tcp
[HOST] mca: base: components_register: component tcp register function successful
[HOST] mca: base: components_register: component tcp register function successful
[HOST] mca: base: components_open: opening btl components
[HOST] mca: base: components_open: found loaded component uct
[HOST] mca: base: components_open: opening btl components
[HOST] mca: base: components_open: found loaded component uct
[HOST] mca: base: components_open: component uct open function successful
[HOST] mca: base: components_open: found loaded component self
[HOST] mca: base: components_open: component self open function successful
[HOST] mca: base: components_open: found loaded component vader
[HOST] mca: base: components_open: component vader open function successful
[HOST] mca: base: components_open: found loaded component tcp
[HOST] mca: base: components_open: component tcp open function successful
[HOST] select: initializing btl component uct
[HOST] select: init of component uct returned failure
[HOST] mca: base: close: component uct closed
[HOST] mca: base: close: unloading component uct
[HOST] select: initializing btl component self
[HOST] select: init of component self returned success
[HOST] select: initializing btl component vader
[HOST] select: init of component vader returned success
[HOST] select: initializing btl component tcp
[HOST] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[HOST] btl: tcp: Found match: 127.0.0.1 (lo)
[HOST] btl:tcp: Attempting to bind to AF_INET port 1024
[HOST] btl:tcp: Successfully bound to AF_INET port 1024
[HOST] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[HOST] btl:tcp: examining interface enp129s0f0
[HOST] btl:tcp: using ipv6 interface enp129s0f0
[HOST] btl:tcp: examining interface docker0
[HOST] btl:tcp: using ipv6 interface docker0
[HOST] select: init of component tcp returned success
[HOST] mca: base: components_register: registering framework pml components
[HOST] mca: base: components_register: found loaded component v
[HOST] mca: base: components_register: component v register function successful
[HOST] mca: base: components_register: found loaded component cm
[HOST] mca: base: components_register: component cm register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component ob1
[HOST] mca: base: components_register: component ob1 register function successful
[HOST] mca: base: components_open: opening pml components
[HOST] mca: base: components_open: found loaded component v
[HOST] mca: base: components_open: component v open function successful
[HOST] mca: base: components_open: found loaded component cm
[HOST] mca: base: close: component cm closed
[HOST] mca: base: close: unloading component cm
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component uct open function successful
[HOST] mca: base: components_open: found loaded component self
[HOST] mca: base: components_open: component self open function successful
[HOST] mca: base: components_open: found loaded component vader
[HOST] mca: base: components_open: component vader open function successful
[HOST] mca: base: components_open: found loaded component tcp
[HOST] mca: base: components_open: component tcp open function successful
[HOST] select: initializing btl component uct
[HOST] select: init of component uct returned failure
[HOST] mca: base: close: component uct closed
[HOST] mca: base: close: unloading component uct
[HOST] select: initializing btl component self
[HOST] select: init of component self returned success
[HOST] select: initializing btl component vader
[HOST] select: init of component vader returned success
[HOST] select: initializing btl component tcp
[HOST] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[HOST] btl: tcp: Found match: 127.0.0.1 (lo)
[HOST] btl:tcp: Attempting to bind to AF_INET port 1024
[HOST] btl:tcp: Attempting to bind to AF_INET port 1025
[HOST] btl:tcp: Successfully bound to AF_INET port 1025
[HOST] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[HOST] btl:tcp: examining interface enp129s0f0
[HOST] btl:tcp: using ipv6 interface enp129s0f0
[HOST] btl:tcp: examining interface docker0
[HOST] btl:tcp: using ipv6 interface docker0
[HOST] select: init of component tcp returned success
[HOST] mca: base: components_register: registering framework pml components
[HOST] mca: base: components_register: found loaded component v
[HOST] mca: base: components_register: component v register function successful
[HOST] mca: base: components_register: found loaded component cm
[HOST] mca: base: components_register: component cm register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component ob1
[HOST] mca: base: components_register: component ob1 register function successful
[HOST] mca: base: components_open: opening pml components
[HOST] mca: base: components_open: found loaded component v
[HOST] mca: base: components_open: component v open function successful
[HOST] mca: base: components_open: found loaded component cm
[HOST] mca: base: close: component cm closed
[HOST] mca: base: close: unloading component cm
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: component monitoring open function successful
[HOST] mca: base: components_open: found loaded component ob1
[HOST] mca: base: components_open: component ob1 open function successful
[HOST] mca: base: components_register: registering framework osc components
[HOST] mca: base: components_register: found loaded component rdma
[HOST] mca: base: components_register: component rdma register function successful
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: component sm register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component pt2pt
[HOST] mca: base: components_register: component pt2pt register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_open: opening osc components
[HOST] mca: base: components_open: found loaded component rdma
[HOST] mca: base: components_open: found loaded component sm
[HOST] mca: base: components_open: component sm open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: found loaded component pt2pt
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] select: component v not in the include list
[HOST] select: initializing pml component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: component monitoring open function successful
[HOST] mca: base: components_open: found loaded component ob1
[HOST] mca: base: components_open: component ob1 open function successful
[HOST] select: init returned priority 51
[HOST] select: component monitoring not in the include list
[HOST] select: initializing pml component ob1
[HOST] select: init returned priority 20
[HOST] selected ucx best priority 51
[HOST] select: component ob1 not selected / finalized
[HOST] select: component ucx selected
[HOST] mca: base: close: component v closed
[HOST] mca: base: close: unloading component v
[HOST] mca: base: close: component monitoring closed
[HOST] mca: base: close: unloading component monitoring
[HOST] mca: base: components_register: registering framework osc components
[HOST] mca: base: components_register: found loaded component rdma
[HOST] mca: base: close: component ob1 closed
[HOST] mca: base: close: unloading component ob1
[HOST] mca: base: components_register: component rdma register function successful
[HOST] mca: base: components_register: found loaded component sm
[HOST] mca: base: components_register: component sm register function successful
[HOST] mca: base: components_register: found loaded component monitoring
[HOST] mca: base: components_register: component monitoring register function successful
[HOST] mca: base: components_register: found loaded component pt2pt
[HOST] mca: base: components_register: component pt2pt register function successful
[HOST] mca: base: components_register: found loaded component ucx
[HOST] mca: base: components_register: component ucx register function successful
[HOST] mca: base: components_open: opening osc components
[HOST] mca: base: components_open: found loaded component rdma
[HOST] mca: base: components_open: found loaded component sm
[HOST] mca: base: components_open: component sm open function successful
[HOST] mca: base: components_open: found loaded component monitoring
[HOST] mca: base: components_open: found loaded component pt2pt
[HOST] mca: base: components_open: found loaded component ucx
[HOST] mca: base: components_open: component ucx open function successful
[HOST] select: component v not in the include list
[HOST] select: initializing pml component ucx
[HOST] select: init returned priority 51
[HOST] select: component monitoring not in the include list
[HOST] select: initializing pml component ob1
[HOST] select: init returned priority 20
[HOST] selected ucx best priority 51
[HOST] select: component ob1 not selected / finalized
[HOST] select: component ucx selected
[HOST] mca: base: close: component v closed
[HOST] mca: base: close: unloading component v
[HOST] mca: base: close: component monitoring closed
[HOST] mca: base: close: unloading component monitoring
[HOST] mca: base: close: component ob1 closed
[HOST] mca: base: close: unloading component ob1
[HOST] mca: base: close: unloading component monitoring
[HOST] mca: base: close: unloading component monitoring
[HOST] check:select: rank=0
[HOST] check:select: checking my pml ucx against rank=0 pml ucx
[HOST] mca: bml: Using self btl for send to [[12519,1],0] on node HOST
[HOST] mca: bml: Using vader btl for send to [[12519,1],0] on node HOST
[HOST] btl:tcp: path from 10.36.131.240 to 10.36.131.240: IPV4 PRIVATE SAME NETWORK
[HOST] btl:tcp: path from 10.36.131.240 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 10.36.131.240: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 172.17.0.1: IPV4 PRIVATE SAME NETWORK
[HOST] mca: bml: Using tcp btl for send to [[12519,1],1] on node HOST
[HOST] btl:tcp: path from 10.36.131.240 to 10.36.131.240: IPV4 PRIVATE SAME NETWORK
[HOST] btl:tcp: path from 10.36.131.240 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 10.36.131.240: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 172.17.0.1: IPV4 PRIVATE SAME NETWORK
[HOST] btl:tcp: path from 10.36.131.240 to 10.36.131.240: IPV4 PRIVATE SAME NETWORK
[HOST] btl:tcp: path from 10.36.131.240 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 10.36.131.240: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 172.17.0.1: IPV4 PRIVATE SAME NETWORK
[HOST] mca: bml: Using tcp btl for send to [[12519,1],1] on node HOST
[HOST] mca: bml: Using tcp btl for send to [[12519,1],0] on node HOST
[HOST] btl:tcp: path from 10.36.131.240 to 10.36.131.240: IPV4 PRIVATE SAME NETWORK
[HOST] btl:tcp: path from 10.36.131.240 to 172.17.0.1: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 10.36.131.240: IPV4 PRIVATE DIFFERENT NETWORK
[HOST] btl:tcp: path from 172.17.0.1 to 172.17.0.1: IPV4 PRIVATE SAME NETWORK
[HOST] mca: bml: Using tcp btl for send to [[12519,1],0] on node HOST
[HOST] mca: bml: Using self btl for send to [[12519,1],1] on node HOST
[HOST] mca: bml: Using vader btl for send to [[12519,1],1] on node HOST
[HOST] rdma component destroying window with id 3
[HOST] rdma component destroying window with id 3
[HOST] *** An error occurred in MPI_Win_create
[HOST] *** reported by process [820445185,0]
[HOST] *** on communicator MPI_COMM_WORLD
[HOST] *** MPI_ERR_WIN: invalid window
[HOST] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[HOST] ***    and potentially your MPI job)
[HOST] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[HOST] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2018

Might be worth building with --enable-debug. Something is going with vader and it's not clear what without the extra debugging.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2018

Hmmm. I wonder if a fix is missing from v4.0.0. Can you try master?

@AdamSimpson
Copy link
Author

I see the same behavior in master. Enabling debug and running from master gives a bit more verbose output, although I'm not familiar enough with it to know if it's useful:

$ mpirun --mca osc_base_verbose 100 -np 2 ./a.out 
[Host] mca: base: components_register: registering framework osc components
[Host] mca: base: components_register: found loaded component rdma
[Host] mca: base: components_register: component rdma register function successful
[Host] mca: base: components_register: found loaded component sm
[Host] mca: base: components_register: component sm register function successful
[Host] mca: base: components_register: found loaded component monitoring
[Host] mca: base: components_register: component monitoring register function successful
[Host] mca: base: components_register: found loaded component pt2pt
[Host] mca: base: components_register: component pt2pt register function successful
[Host] mca: base: components_register: found loaded component ucx
[Host] mca: base: components_register: component ucx register function successful
[Host] mca: base: components_open: opening osc components
[Host] mca: base: components_open: found loaded component rdma
[Host] mca: base: components_open: found loaded component sm
[Host] mca: base: components_open: component sm open function successful
[Host] mca: base: components_open: found loaded component monitoring
[Host] mca: base: components_open: found loaded component pt2pt
[Host] mca: base: components_open: found loaded component ucx
[Host] mca: base: components_open: component ucx open function successful
[Host] mca: base: components_register: registering framework osc components
[Host] mca: base: components_register: found loaded component rdma
[Host] mca: base: components_register: component rdma register function successful
[Host] mca: base: components_register: found loaded component sm
[Host] mca: base: components_register: component sm register function successful
[Host] mca: base: components_register: found loaded component monitoring
[Host] mca: base: components_register: component monitoring register function successful
[Host] mca: base: components_register: found loaded component pt2pt
[Host] mca: base: components_register: component pt2pt register function successful
[Host] mca: base: components_register: found loaded component ucx
[Host] mca: base: components_register: component ucx register function successful
[Host] mca: base: components_open: opening osc components
[Host] mca: base: components_open: found loaded component rdma
[Host] mca: base: components_open: found loaded component sm
[Host] mca: base: components_open: component sm open function successful
[Host] mca: base: components_open: found loaded component monitoring
[Host] mca: base: components_open: found loaded component pt2pt
[Host] mca: base: components_open: found loaded component ucx
[Host] mca: base: components_open: component ucx open function successful
[Host] mca: base: close: unloading component monitoring
[Host] mca: base: close: unloading component monitoring
[Host] selected btl: vader
[Host] selected btl: vader
[Host] creating osc/rdma window of flavor 1 with id 3
[Host] selected btl: vader
[Host] creating osc/rdma window of flavor 1 with id 3
[Host] selected btl: vader
[Host] allocating shared internal state
[Host] allocating shared internal state
[Host] failed to allocate internal state
[Host] rdma component destroying window with id 3
[Host] failed to allocate internal state
[Host] rdma component destroying window with id 3
[Host] *** An error occurred in MPI_Win_create
[Host] *** reported by process [1818296321,1]
[Host] *** on communicator MPI_COMM_WORLD
[Host] *** MPI_ERR_WIN: invalid window
[Host] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Host] ***    and potentially your MPI job)
[Host] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[Host] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2018

Ok. That helps. I will take a look in the morning.

@AdamSimpson
Copy link
Author

Here's a bit more debug info that might help:

$ mpirun --allow-run-as-root --mca osc_base_verbose 100 --mca osc_rdma_verbose 100 --mca shmem_base_verbose 100 -np 2 ./a.out 
[HOST:20055] mca: base: components_register: registering framework shmem components
[HOST:20055] mca: base: components_register: found loaded component mmap
[HOST:20055] mca: base: components_register: component mmap register function successful
[HOST:20055] mca: base: components_register: found loaded component posix
[HOST:20055] mca: base: components_register: component posix register function successful
[HOST:20055] mca: base: components_register: found loaded component sysv
[HOST:20055] mca: base: components_register: component sysv register function successful
[HOST:20055] mca: base: components_open: opening shmem components
[HOST:20055] mca: base: components_open: found loaded component mmap
[HOST:20055] mca: base: components_open: component mmap open function successful
[HOST:20055] mca: base: components_open: found loaded component posix
[HOST:20055] mca: base: components_open: component posix open function successful
[HOST:20055] mca: base: components_open: found loaded component sysv
[HOST:20055] mca: base: components_open: component sysv open function successful
[HOST:20055] shmem: base: runtime_query: Auto-selecting shmem components
[HOST:20055] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[HOST:20055] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[HOST:20055] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[HOST:20055] shmem: posix: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20055] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[HOST:20055] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[HOST:20055] shmem: sysv: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20055] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[HOST:20055] shmem: base: runtime_query: (shmem) Selected component [mmap]
[HOST:20055] mca: base: close: unloading component posix
[HOST:20055] mca: base: close: unloading component sysv
[HOST:20055] shmem: base: best_runnable_component_name: Searching for best runnable component.
[HOST:20055] shmem: base: best_runnable_component_name: Found best runnable component: (mmap).
[HOST:20060] mca: base: components_register: registering framework shmem components
[HOST:20060] mca: base: components_register: found loaded component mmap
[HOST:20060] mca: base: components_register: component mmap register function successful
[HOST:20060] mca: base: components_register: found loaded component posix
[HOST:20060] mca: base: components_register: component posix register function successful
[HOST:20060] mca: base: components_register: found loaded component sysv
[HOST:20060] mca: base: components_register: component sysv register function successful
[HOST:20060] mca: base: components_open: opening shmem components
[HOST:20060] mca: base: components_open: found loaded component mmap
[HOST:20060] mca: base: components_open: component mmap open function successful
[HOST:20060] mca: base: components_open: found loaded component posix
[HOST:20060] mca: base: components_open: component posix open function successful
[HOST:20060] mca: base: components_open: found loaded component sysv
[HOST:20060] mca: base: components_open: component sysv open function successful
[HOST:20060] shmem: base: runtime_query: Auto-selecting shmem components
[HOST:20060] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[HOST:20060] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[HOST:20060] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[HOST:20060] shmem: posix: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20060] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[HOST:20060] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[HOST:20060] shmem: sysv: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20060] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[HOST:20060] shmem: base: runtime_query: (shmem) Selected component [mmap]
[HOST:20060] mca: base: close: unloading component posix
[HOST:20060] mca: base: close: unloading component sysv
[HOST:20061] mca: base: components_register: registering framework shmem components
[HOST:20061] mca: base: components_register: found loaded component mmap
[HOST:20061] mca: base: components_register: component mmap register function successful
[HOST:20061] mca: base: components_register: found loaded component posix
[HOST:20061] mca: base: components_register: component posix register function successful
[HOST:20061] mca: base: components_register: found loaded component sysv
[HOST:20061] mca: base: components_register: component sysv register function successful
[HOST:20061] mca: base: components_open: opening shmem components
[HOST:20061] mca: base: components_open: found loaded component mmap
[HOST:20061] mca: base: components_open: component mmap open function successful
[HOST:20061] mca: base: components_open: found loaded component posix
[HOST:20061] mca: base: components_open: component posix open function successful
[HOST:20061] mca: base: components_open: found loaded component sysv
[HOST:20061] mca: base: components_open: component sysv open function successful
[HOST:20061] shmem: base: runtime_query: Auto-selecting shmem components
[HOST:20061] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[HOST:20061] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[HOST:20061] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[HOST:20061] shmem: posix: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20061] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[HOST:20061] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[HOST:20061] shmem: sysv: runtime_query: NO HINT PROVIDED:starting run-time test...
[HOST:20061] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[HOST:20061] shmem: base: runtime_query: (shmem) Selected component [mmap]
[HOST:20061] mca: base: close: unloading component posix
[HOST:20061] mca: base: close: unloading component sysv
[HOST:20060] shmem: mmap: shmem_ds_resetting
[HOST:20060] shmem: mmap: backing store base directory: /dev/shm/vader_segment.HOST.6f7d0001.0
[HOST:20060] shmem: mmap: create successful (id: 15, size: 4194312, name: /dev/shm/vader_segment.HOST.6f7d0001.0)
[HOST:20060] shmem: mmap: attach successful (id: 15, size: 4194312, name: /dev/shm/vader_segment.HOST.6f7d0001.0)
[HOST:20061] shmem: mmap: shmem_ds_resetting
[HOST:20061] shmem: mmap: backing store base directory: /dev/shm/vader_segment.HOST.6f7d0001.1
[HOST:20061] shmem: mmap: create successful (id: 15, size: 4194312, name: /dev/shm/vader_segment.HOST.6f7d0001.1)
[HOST:20061] shmem: mmap: attach successful (id: 15, size: 4194312, name: /dev/shm/vader_segment.HOST.6f7d0001.1)
[HOST:20060] mca: base: components_register: registering framework osc components
[HOST:20060] mca: base: components_register: found loaded component rdma
[HOST:20060] mca: base: components_register: component rdma register function successful
[HOST:20060] mca: base: components_register: found loaded component sm
[HOST:20060] mca: base: components_register: component sm register function successful
[HOST:20060] mca: base: components_register: found loaded component monitoring
[HOST:20060] mca: base: components_register: component monitoring register function successful
[HOST:20060] mca: base: components_register: found loaded component pt2pt
[HOST:20060] mca: base: components_register: component pt2pt register function successful
[HOST:20060] mca: base: components_register: found loaded component ucx
[HOST:20060] mca: base: components_register: component ucx register function successful
[HOST:20060] mca: base: components_open: opening osc components
[HOST:20060] mca: base: components_open: found loaded component rdma
[HOST:20060] mca: base: components_open: found loaded component sm
[HOST:20060] mca: base: components_open: component sm open function successful
[HOST:20060] mca: base: components_open: found loaded component monitoring
[HOST:20060] mca: base: components_open: found loaded component pt2pt
[HOST:20060] mca: base: components_open: found loaded component ucx
[HOST:20060] mca: base: components_open: component ucx open function successful
[HOST:20061] mca: base: components_register: registering framework osc components
[HOST:20061] mca: base: components_register: found loaded component rdma
[HOST:20061] mca: base: components_register: component rdma register function successful
[HOST:20061] mca: base: components_register: found loaded component sm
[HOST:20061] mca: base: components_register: component sm register function successful
[HOST:20061] mca: base: components_register: found loaded component monitoring
[HOST:20061] mca: base: components_register: component monitoring register function successful
[HOST:20061] mca: base: components_register: found loaded component pt2pt
[HOST:20061] mca: base: components_register: component pt2pt register function successful
[HOST:20061] mca: base: components_register: found loaded component ucx
[HOST:20061] mca: base: components_register: component ucx register function successful
[HOST:20061] mca: base: components_open: opening osc components
[HOST:20061] mca: base: components_open: found loaded component rdma
[HOST:20061] mca: base: components_open: found loaded component sm
[HOST:20061] mca: base: components_open: component sm open function successful
[HOST:20061] mca: base: components_open: found loaded component monitoring
[HOST:20061] mca: base: components_open: found loaded component pt2pt
[HOST:20061] mca: base: components_open: found loaded component ucx
[HOST:20061] mca: base: components_open: component ucx open function successful
[HOST:20060] mca: base: close: unloading component monitoring
[HOST:20061] mca: base: close: unloading component monitoring
[HOST:20060] selected btl: vader
[HOST:20061] shmem: mmap: attach successful (id: 15, size: 4194312, name: /dev/shm/vader_segment.HOST.6f7d0001.1)
[HOST:20061] selected btl: vader
[HOST:20060] creating osc/rdma window of flavor 1 with id 3
[HOST:20060] selected btl: vader
[HOST:20061] creating osc/rdma window of flavor 1 with id 3
[HOST:20061] selected btl: vader
[HOST:20060] allocating shared internal state
[HOST:20060] shmem: mmap: shmem_ds_resetting
[HOST:20060] shmem: mmap: backing store base directory: /dev/shm/osc_rdma.HOST.6f7d0001.3
[HOST:20061] allocating shared internal state
[HOST:20060] shmem: mmap: create successful (id: 25, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20060] shmem: mmap: attach successful (id: 25, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20061] shmem: mmap: attach successful (id: 25, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20060] shmem: mmap: unlinking(id: 25, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20061] failed to allocate internal state
[HOST:20061] rdma component destroying window with id 3
[HOST:20060] failed to allocate internal state
[HOST:20060] rdma component destroying window with id 3
[HOST:20061] shmem: mmap: detaching (id: 25, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20060] shmem: mmap: detaching (id: -1, size: 744, name: /dev/shm/osc_rdma.HOST.6f7d0001.3)
[HOST:20061] shmem: mmap: shmem_ds_resetting
[HOST:20060] shmem: mmap: shmem_ds_resetting
[HOST:20060] *** An error occurred in MPI_Win_create
[HOST:20060] *** reported by process [1870462977,0]
[HOST:20060] *** on communicator MPI_COMM_WORLD
[HOST:20060] *** MPI_ERR_WIN: invalid window
[HOST:20060] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[HOST:20060] ***    and potentially your MPI job)
[HOST:20055] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[HOST:20055] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[HOST:20055] mca: base: close: component mmap closed
[HOST:20055] mca: base: close: unloading component mmap

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2018

Interesting. This looks like an edge case initialization failure in osc/rdma. Should have it fixed today.

@hjelmn
Copy link
Member

hjelmn commented Jan 7, 2019

Couldn't work on this over the break. A combination of bad company policy and VMware suckage. I should get it fixed today.

@hjelmn
Copy link
Member

hjelmn commented Jan 7, 2019

Ok, I see what is happening. Because pml/ucx is in use btl/vader is not set up properly. Should be easy enough to fix.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 7, 2019
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.

References open-mpi#6201

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 7, 2019
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.

References open-mpi#6201

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 30b8336)
@AdamSimpson
Copy link
Author

Thanks for getting this fixed!

We've tested #6249 against some production applications and it seems to fix the issues we were originally seeing.

@hjelmn hjelmn closed this as completed Jan 16, 2019
hppritcha pushed a commit to hppritcha/ompi that referenced this issue Mar 27, 2019
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.

References open-mpi#6201

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 30b8336)
@liaochenlanruo
Copy link

echo 0 > /proc/sys/kernel/yama/ptrace_scope

@nitinpatil1985
Copy link

I am runnig openmpi 2.1.6 with ucx and I am getting the following:

[r2i0n6:342074] *** An error occurred in MPI_Win_create
[r2i0n6:342074] *** reported by process [3024158721,403]
[r2i0n6:342074] *** on communicator MPI COMMUNICATOR 11 DUP FROM 9
[r2i0n6:342074] *** MPI_ERR_WIN: invalid window
[r2i0n6:342074] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[r2i0n6:342074] *** and potentially your MPI job)

Any solution for this?

@jsquyres
Copy link
Member

Please upgrade to a more recent version of Open MPI (e.g., 4.0.1). If you are still having problems, please open a new issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants