Skip to content

problem starting a ompi job in a mix BE/LE cluster #639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larrystevenwise opened this issue Jun 11, 2015 · 10 comments
Closed

problem starting a ompi job in a mix BE/LE cluster #639

larrystevenwise opened this issue Jun 11, 2015 · 10 comments
Milestone

Comments

@larrystevenwise
Copy link

I'm seeing an error trying to run a simple OMPI job on a 2 node cluster where one node is a PPC64 BE byte order and the other is a
X86_64 LE byte order node. OMPI 1.8.4 is configured with --enable-heterogeneous:

./configure --with-openib=/usr CC=gcc CXX=g++ F77=gfortran FC=gfortran
--enable-mpirun-prefix-by-default --prefix=/usr/mpi/gcc/openmpi-1.8.4/
--with-openib-libdir=/usr/lib64/ --libdir=/usr/mpi/gcc/openmpi-1.8.4/lib64/
--with-contrib-vt-flags=--disable-iotrace --enable-mpi-thread-multiple
--with-threads=posix --enable-heterogeneous && make -j8 && make -j8 install

And the job started this way:

/usr/mpi/gcc/openmpi-1.8.4/bin/mpirun -np 2 -host
ppc64,atlas3 --allow-run-as-root --mca btl_openib_addr_include 102.1.1.0/24
--mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.8.4/tests/IMB-3.2/IMB-MPI1
pingpong

But we see the following error. Note atlas3 is using the vendor ID that is in the wrong byte order (0x25140000 instead of 0x1425):

The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other. This
generally happens when you are using OpenFabrics devices from
different vendors on the same network. You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

Local host: ppc64-rhel71
Local adapter: cxgb4_0 (vendor 0x1425, part ID 21505)
Local queues: P,65536,64

Remote host: atlas3
Remote adapter: (vendor 0x25140000, part ID 22282240)
Remote queues:
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

@larrystevenwise larrystevenwise self-assigned this Jun 11, 2015
@larrystevenwise
Copy link
Author

Steve,

MCA_BTL_OPENIB_MODEX_MSG_{HTON,NTOH} do not convert all the fields of the mca_btl_openib_modex_message_t struct.

I would start here ...

Cheers,

Gilles

@larrystevenwise
Copy link
Author

I'm waiting to get time on the P7 BE node to further debug this

@jsquyres jsquyres added this to the Open MPI v1.10.0 milestone Jul 25, 2015
@jsquyres
Copy link
Member

@larrystevenwise Any progress?

@jsquyres jsquyres modified the milestones: Open MPI v1.10.0, Open MPI v1.10.1 Sep 2, 2015
@jsquyres
Copy link
Member

jsquyres commented Sep 2, 2015

@larrystevenwise Any progress?

@larrystevenwise
Copy link
Author

No, unfortunately...

@jsquyres
Copy link
Member

Pushing this off to v1.10.2.

@jsquyres jsquyres modified the milestones: Open MPI v1.10.2, Open MPI v1.10.1 Oct 20, 2015
@jsquyres jsquyres modified the milestones: v1.10.3, v1.10.2 Nov 10, 2015
@jsquyres
Copy link
Member

Per http://www.open-mpi.org/community/lists/devel/2015/11/18354.php, this is not a critical bug fix. Pushing to v1.10.3.

@rhc54
Copy link
Contributor

rhc54 commented May 4, 2016

Pushing to 2.1 as I doubt we'll deal with this in the 1.10 series

@rhc54 rhc54 modified the milestones: v2.1.0, v1.10.3 May 4, 2016
jsquyres added a commit to jsquyres/ompi that referenced this issue Sep 19, 2016
darray type incorrectly created or interpreted in mpi_alltoallw (former open-mpi/ompi@open-mpi#965)
@hppritcha hppritcha modified the milestones: Future, v2.1.0 Jan 24, 2017
@hppritcha
Copy link
Member

Moving this to future because we have no plans to fix --enable-heterogenous.

@jjhursey
Copy link
Member

Open MPI has dropped support for BE PPC, and only supports ppc64le. I'm going to close this ticket, but if the probably persists on a ppc64le environment please reopen with more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants