Skip to content

OMPI 4.0.0rc5 fails to see slots on Cray XC40 #5973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
devreal opened this issue Oct 25, 2018 · 14 comments
Closed

OMPI 4.0.0rc5 fails to see slots on Cray XC40 #5973

devreal opened this issue Oct 25, 2018 · 14 comments

Comments

@devreal
Copy link
Contributor

devreal commented Oct 25, 2018

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI 4.0.0rc5 downloaded from Github

Please describe the system on which you are running

Cray XC40


Details of the problem

I'm trying to run jobs using Open MPI's mpirun (so I can set mca parameters directly) with multiple processes per node, which fails:

$ mpirun -n 2 -N 2 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

Running a single process per node works as expected:

$ mpirun -n 2 -N 1 hostname
nid01189
nid01188

I can run applications linked against Open MPI using aprun (with PMI_NO_FORK set) and I can pass --oversubscribe to mpirun to run with multiple ranks per node.

Is this the expected behavior? Each node has 24 cores / 48 threads so I would expect Open MPI to see the number of slots available.

I'm attaching the config.log. It seems that support for Cray PMI is properly detected.
config.log

@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2018

Try adding --display-allocation to your cmd line and let's see what was detected. If this is a debug build, you can also add --mca ras_base_verbose 10 to the cmd line for more detailed information.

@devreal
Copy link
Contributor Author

devreal commented Oct 26, 2018

Here is the output from a debug build:

$ mpirun -n 2 -N 2 --bind-to socket --display-allocation --mca ras_base_verbose 10 hostname 
[mom05:02155] mca: base: components_register: registering framework ras components
[mom05:02155] mca: base: components_register: found loaded component simulator
[mom05:02155] mca: base: components_register: component simulator register function successful
[mom05:02155] mca: base: components_register: found loaded component alps
[mom05:02155] mca: base: components_register: component alps register function successful
[mom05:02155] mca: base: components_register: found loaded component slurm
[mom05:02155] mca: base: components_register: component slurm register function successful
[mom05:02155] mca: base: components_register: found loaded component tm
[mom05:02155] mca: base: components_register: component tm register function successful
[mom05:02155] mca: base: components_open: opening ras components
[mom05:02155] mca: base: components_open: found loaded component simulator
[mom05:02155] mca: base: components_open: found loaded component alps
[mom05:02155] mca: base: components_open: component alps open function successful
[mom05:02155] mca: base: components_open: found loaded component slurm
[mom05:02155] mca: base: components_open: component slurm open function successful
[mom05:02155] mca: base: components_open: found loaded component tm
[mom05:02155] mca: base: components_open: component tm open function successful
[mom05:02155] mca:base:select: Auto-selecting ras components
[mom05:02155] mca:base:select:(  ras) Querying component [simulator]
[mom05:02155] mca:base:select:(  ras) Querying component [alps]
[mom05:02155] ras:alps: available for selection
[mom05:02155] mca:base:select:(  ras) Query of component [alps] set priority to 75
[mom05:02155] mca:base:select:(  ras) Querying component [slurm]
[mom05:02155] mca:base:select:(  ras) Querying component [tm]
[mom05:02155] mca:base:select:(  ras) Query of component [tm] set priority to 100
[mom05:02155] mca:base:select:(  ras) Selected component [tm]
[mom05:02155] mca: base: close: unloading component simulator
[mom05:02155] mca: base: close: unloading component alps
[mom05:02155] mca: base: close: component slurm closed
[mom05:02155] mca: base: close: unloading component slurm
[mom05:02155] [[57051,0],0] ras:base:allocate
[mom05:02155] [[57051,0],0] ras:tm:allocate:discover: got hostname 5775
[mom05:02155] [[57051,0],0] ras:tm:allocate:discover: not found -- added to list
[mom05:02155] [[57051,0],0] ras:tm:allocate:discover: got hostname 5776
[mom05:02155] [[57051,0],0] ras:tm:allocate:discover: not found -- added to list
[mom05:02155] [[57051,0],0] ras:base:node_insert inserting 2 nodes
[mom05:02155] [[57051,0],0] ras:base:node_insert node 5775 slots 1
[mom05:02155] [[57051,0],0] ras:base:node_insert node 5776 slots 1

======================   ALLOCATED NODES   ======================
	5775: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
	5776: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================

======================   ALLOCATED NODES   ======================
	5775: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
	5776: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mom05:02155] [[57051,0],0] ras:tm:finalize: success (nothing to do)
[mom05:02155] mca: base: close: unloading component tm

It seems that PBS is chosen over ALPS. The PBS environment:

$ env | grep PBS_
PBS_VERSION=TORQUE-6.1.2.h1
PBS_JOBNAME=STDIN
PBS_ENVIRONMENT=PBS_INTERACTIVE
PBS_O_TZ=Europe/Berlin
PBS_O_WORKDIR=/zhome/academic/HLRS/hlrs/hpcjschu/src/openmpi/ompi-4.0.0rc5/build
PBS_TASKNUM=1
PBS_O_HOME=/zhome/academic/HLRS/hlrs/hpcjschu
PBS_WALLTIME=1140
PBS_GPUFILE=/var/spool/torque/aux//2079452.hazelhen-batch.hww.hlrs.degpu
PBS_MOMPORT=15003
PBS_O_QUEUE=test
PBS_O_LANG=en_US.UTF-8
PBS_JOBCOOKIE=61D3933309DE81284C2DFD039CCD5942
PBS_NODENUM=0
PBS_NUM_NODES=2
PBS_O_SHELL=/bin/bash
PBS_JOBID=2079452.hazelhen-batch.hww.hlrs.de
PBS_O_HOST=eslogin002
PBS_VNODENUM=0
PBS_QUEUE=test
PBS_MICFILE=/var/spool/torque/aux//2079452.hazelhen-batch.hww.hlrs.demic
PBS_O_SUBMIT_FILTER=/opt/torque/tools/torque_submitfilter
PBS_NP=2
PBS_NUM_PPN=1
PBS_O_SERVER=hazelhen-batch.hww.hlrs.de
PBS_NODEFILE=/var/spool/torque/aux//2079452.hazelhen-batch.hww.hlrs.de
PBS_O_PATH=/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/dash-0.3.0-tasking/bin/:/opt/cray/pe/perftools/7.0.4/bin:/opt/cray/pe/papi/5.6.0.4/bin:/opt/cray/rca/2.2.16-6.0.5.0_15.34__g5e09e6d.ari/bin:/opt/cray/alps/6.5.29-6.0.5.1_3.1__gc22dc90.ari/sbin:/opt/cray/job/2.2.2-6.0.5.0_8.47__g3c644b5.ari/bin:/opt/hlrs/system/wrappers/bin:/opt/hlrs/system/ws/87c091d/bin:/opt/torque//sbin:/opt/torque//bin:/opt/moab/bin:/opt/moab/sbin:/opt/cray/pe/mpt/7.7.0/gni/bin:/opt/cray/pe/craype/2.5.15/bin:/sw/hazelhen-cle6/hlrs/compiler/intel/Compiler/18.0.1.163/compilers_and_libraries_2018.1.163/linux/bin/intel64:/opt/gcc/7.3.0/bin:/opt/cray/elogin/eproxy/2.0.22-6.0.5.0_2.1__g1ebe45c.ari/bin:/opt/cray/pe/modules/3.2.11.1/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/cray/pe/bin

I also tried increasing the priority of ALPS but to no avail:

$ mpirun -n 2 -N 2 --bind-to socket --display-allocation --mca ras_base_verbose 10 --mca ras_alps_priority 101 hostname
[mom05:02705] mca: base: components_register: registering framework ras components
[mom05:02705] mca: base: components_register: found loaded component simulator
[mom05:02705] mca: base: components_register: component simulator register function successful
[mom05:02705] mca: base: components_register: found loaded component alps
[mom05:02705] mca: base: components_register: component alps register function successful
[mom05:02705] mca: base: components_register: found loaded component slurm
[mom05:02705] mca: base: components_register: component slurm register function successful
[mom05:02705] mca: base: components_register: found loaded component tm
[mom05:02705] mca: base: components_register: component tm register function successful
[mom05:02705] mca: base: components_open: opening ras components
[mom05:02705] mca: base: components_open: found loaded component simulator
[mom05:02705] mca: base: components_open: found loaded component alps
[mom05:02705] mca: base: components_open: component alps open function successful
[mom05:02705] mca: base: components_open: found loaded component slurm
[mom05:02705] mca: base: components_open: component slurm open function successful
[mom05:02705] mca: base: components_open: found loaded component tm
[mom05:02705] mca: base: components_open: component tm open function successful
[mom05:02705] mca:base:select: Auto-selecting ras components
[mom05:02705] mca:base:select:(  ras) Querying component [simulator]
[mom05:02705] mca:base:select:(  ras) Querying component [alps]
[mom05:02705] ras:alps: available for selection
[mom05:02705] mca:base:select:(  ras) Query of component [alps] set priority to 101
[mom05:02705] mca:base:select:(  ras) Querying component [slurm]
[mom05:02705] mca:base:select:(  ras) Querying component [tm]
[mom05:02705] mca:base:select:(  ras) Query of component [tm] set priority to 100
[mom05:02705] mca:base:select:(  ras) Selected component [alps]
[mom05:02705] mca: base: close: unloading component simulator
[mom05:02705] mca: base: close: component slurm closed
[mom05:02705] mca: base: close: unloading component slurm
[mom05:02705] mca: base: close: unloading component tm
[mom05:02705] [[56353,0],0] ras:base:allocate
[mom05:02705] ras:alps:allocate: Trying ALPS configuration file: "/etc/sysconfig/alps"
[mom05:02705] ras:alps:allocate: parser_ini
[mom05:02705] ras:alps:allocate: Trying ALPS configuration file: "/etc/alps.conf"
[mom05:02705] ras:alps:allocate: Skipping ALPS configuration file: "/etc/alps.conf" (No such file or directory).
[mom05:02705] ras:alps:allocate: Trying ALPS configuration file: "/etc/opt/cray/alps/alps.conf"
[mom05:02705] ras:alps:allocate: parser_separated_columns
[mom05:02705] ras:alps:allocate: Located ALPS scheduler file: "/alps_shared/appinfo"
[mom05:02705] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
[mom05:02705] ras:alps:allocate: begin processing appinfo file
[mom05:02705] ras:alps:allocate: file /alps_shared/appinfo read
[mom05:02705] ras:alps:allocate: 244 entries in file
[mom05:02705] ras:alps:allocate: read data for resId 15432882 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15432882 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433112 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433112 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433119 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433119 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433143 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433143 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433143 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433149 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433149 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433461 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433461 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433463 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433463 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433513 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433513 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433524 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433524 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433613 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433613 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433619 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433619 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433627 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433627 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433628 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433628 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433652 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433652 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433664 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433664 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433668 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433668 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433673 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433673 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433679 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433679 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433682 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433682 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433696 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433696 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433701 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433701 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433702 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433702 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433707 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433707 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433710 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433710 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433713 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433713 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433717 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433717 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433718 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433718 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433719 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433719 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433722 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433722 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433724 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433724 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433728 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433728 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433730 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433730 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433746 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433746 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433753 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433753 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433756 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433756 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433759 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433759 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433768 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433768 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433772 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433772 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433786 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433786 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433787 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433787 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433789 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433789 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433790 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433790 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433805 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433805 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433807 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433807 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433811 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433811 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433813 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433813 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433816 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433816 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433823 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433823 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433824 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433824 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433828 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433828 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433831 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433831 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433856 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433856 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433865 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433865 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433869 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433869 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433874 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15433874 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434069 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434069 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434089 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434089 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434104 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434104 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434135 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434135 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434140 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434140 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434145 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434145 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434146 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434146 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434147 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434147 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434148 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434148 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434151 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434151 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434152 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434152 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434153 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434153 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434156 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434156 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434162 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434162 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434165 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434165 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434166 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434166 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434177 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434177 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434179 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434179 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434181 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434181 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434185 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434185 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434190 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434190 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434192 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434192 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434197 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434197 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434200 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434200 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434203 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434203 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434204 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434204 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434206 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434206 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434207 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434207 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434209 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434209 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434216 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434216 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434221 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434221 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434223 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434223 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434228 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434228 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434232 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434245 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434245 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434246 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434246 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434252 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434252 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434252 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434256 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434256 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434257 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434257 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434262 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434262 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434265 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434265 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434266 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434266 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434267 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434267 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434268 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434268 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434270 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434270 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434271 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434271 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434273 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434273 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434279 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434279 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434281 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434281 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434284 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434284 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434292 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434292 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434293 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434293 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434294 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434294 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434295 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434295 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434297 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434297 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434297 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434300 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434309 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434309 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434310 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434310 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434311 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434311 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434314 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434314 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434316 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434316 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434317 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434317 - myId 15434320
[mom05:02705] ras:alps:allocate: read data for resId 15434320 - myId 15434320
[mom05:02705] ras:alps:read_appinfo(modern): processing NID 5775 with 1 slots
[mom05:02705] ras:alps:read_appinfo(modern): processing NID 5776 with 1 slots
[mom05:02705] ras:alps:allocate: success
[mom05:02705] [[56353,0],0] ras:base:node_insert inserting 2 nodes
[mom05:02705] [[56353,0],0] ras:base:node_insert node 5775 slots 1
[mom05:02705] [[56353,0],0] ras:base:node_insert node 5776 slots 1

======================   ALLOCATED NODES   ======================
	5775: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
	5776: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================

======================   ALLOCATED NODES   ======================
	5775: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
	5776: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mom05:02705] ras:alps:finalize: success (nothing to do)
[mom05:02705] mca: base: close: unloading component alps

@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2018

Hmmm...well that is certainly very confusing. It looks like you have both ALPS and PBS on the same system? How would/does that work?

You can eliminate the PBS support by simply configuring OMPI --without-tm. Or you can put --mca ras alps --mca plm alps on your cmd line.

For whatever reason, both the ALPS and PBS allocation parsers agree that you were only allocated 1 slot on each of NID 5775 and 5776. The PBS environment confirms that assignment. Afraid I don't know enough about the ALPS parser to understand that output.

@hppritcha ?

Guess you could also check that you don't have a default hostfile or -host set in your environment or MCA param file.

@devreal
Copy link
Contributor Author

devreal commented Oct 26, 2018

In case it's of any use: here is the debug output of the plm component:

$ mpirun -n 2 -N 2 --bind-to socket --display-allocation --mca ras_base_verbose 0 --mca plm_base_verbose 10 --mca ras alps --mca plm alps --mca plm_alps_debug true hostname
[mom12:48155] mca: base: components_register: registering framework plm components
[mom12:48155] mca: base: components_register: found loaded component alps
[mom12:48155] mca: base: components_register: component alps register function successful
[mom12:48155] mca: base: components_open: opening plm components
[mom12:48155] mca: base: components_open: found loaded component alps
[mom12:48155] mca: base: components_open: component alps open function successful
[mom12:48155] mca:base:select: Auto-selecting plm components
[mom12:48155] mca:base:select:(  plm) Querying component [alps]
[mom12:48155] [[INVALID],INVALID] plm:alps: available for selection
[mom12:48155] mca:base:select:(  plm) Query of component [alps] set priority to 100
[mom12:48155] mca:base:select:(  plm) Selected component [alps]
[mom12:48155] plm:base:set_hnp_name: initial bias 48155 nodename hash 4256695528
[mom12:48155] plm:base:set_hnp_name: final jobfam 17739
[mom12:48155] [[17739,0],0] plm:base:receive start comm
[mom12:48155] [[17739,0],0] plm:base:setup_job
[mom12:48155] [[17739,0],0] plm:base:setup_vm
[mom12:48155] [[17739,0],0] plm:base:setup_vm creating map
[mom12:48155] [[17739,0],0] plm:base:setup_vm add new daemon [[17739,0],1]
[mom12:48155] [[17739,0],0] plm:base:setup_vm assigning new daemon [[17739,0],1] to node 4783
[mom12:48155] [[17739,0],0] plm:base:setup_vm add new daemon [[17739,0],2]
[mom12:48155] [[17739,0],0] plm:base:setup_vm assigning new daemon [[17739,0],2] to node 4782
[mom12:48155] plm:alps: final top-level argv:
[mom12:48155] plm:alps:     aprun -n 2 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 orted -mca ess_base_jobid 1162543104 -mca ess_base_vpid 1 -mca ess_base_num_procs 3 -mca orte_node_regex mom[2:12],[4:4783,4782]@0(3) -mca orte_hnp_uri 1162543104.0;tcp://10.128.13.154,193.196.155.85:16185 --mca ras_base_verbose 0 --mca plm_base_verbose 10 --mca ras alps --mca plm_alps_debug true
[mom12:48155] [[17739,0],0] plm:alps: final top-level argv:
	aprun -n 2 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 orted -mca ess_base_jobid "1162543104" -mca ess_base_vpid "1" -mca ess_base_num_procs "3" -mca orte_node_regex "mom[2:12],[4:4783,4782]@0(3)" -mca orte_hnp_uri "1162543104.0;tcp://10.128.13.154,193.196.155.85:16185" --mca ras_base_verbose "0" --mca plm_base_verbose "10" --mca ras "alps" --mca plm_alps_debug "true"
[nid04783:39204] mca: base: components_register: registering framework plm components
[nid04783:39204] mca: base: components_register: found loaded component alps
[nid04783:39204] mca: base: components_register: component alps register function successful
[nid04783:39204] mca: base: components_open: opening plm components
[nid04783:39204] mca: base: components_open: found loaded component alps
[nid04783:39204] mca: base: components_open: component alps open function successful
[nid04783:39204] mca:base:select: Auto-selecting plm components
[nid04783:39204] mca:base:select:(  plm) Querying component [alps]
[nid04783:39204] [[17739,0],1] plm:alps: available for selection
[nid04783:39204] mca:base:select:(  plm) Query of component [alps] set priority to 100
[nid04783:39204] mca:base:select:(  plm) Selected component [alps]
[nid04782:39143] mca: base: components_register: registering framework plm components
[nid04782:39143] mca: base: components_register: found loaded component alps
[nid04782:39143] mca: base: components_register: component alps register function successful
[nid04782:39143] mca: base: components_open: opening plm components
[nid04782:39143] mca: base: components_open: found loaded component alps
[nid04782:39143] mca: base: components_open: component alps open function successful
[nid04782:39143] mca:base:select: Auto-selecting plm components
[nid04782:39143] mca:base:select:(  plm) Querying component [alps]
[nid04782:39143] [[17739,0],2] plm:alps: available for selection
[nid04782:39143] mca:base:select:(  plm) Query of component [alps] set priority to 100
[nid04782:39143] mca:base:select:(  plm) Selected component [alps]
[nid04783:39204] [[17739,0],1] plm:base:receive start comm
[nid04782:39143] [[17739,0],2] plm:base:receive start comm
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch from daemon [[17739,0],2]
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch from daemon [[17739,0],2] on node nid04782
[mom12:48155] [[17739,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:24L2:24L1:24C:48H:x86_64:le FROM NODE nid04782
[mom12:48155] [[17739,0],0] TOPOLOGY ALREADY RECORDED
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch completed for daemon [[17739,0],2] at contact (null)
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch job [17739,0] recvd 2 of 3 reported daemons
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch from daemon [[17739,0],1]
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch from daemon [[17739,0],1] on node nid04783
[mom12:48155] [[17739,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:24L2:24L1:24C:48H:x86_64:le FROM NODE nid04783
[mom12:48155] [[17739,0],0] TOPOLOGY ALREADY RECORDED
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch completed for daemon [[17739,0],1] at contact (null)
[mom12:48155] [[17739,0],0] plm:base:orted_report_launch job [17739,0] recvd 3 of 3 reported daemons

======================   ALLOCATED NODES   ======================
	4783: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
	4782: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mom12:48155] [[17739,0],0] plm:alps: terminating orteds
[mom12:48155] [[17739,0],0] plm:base:orted_cmd sending orted_exit commands
[mom12:48155] [[17739,0],0] plm:alps: terminated orteds
[nid04783:39204] [[17739,0],1] plm:base:receive stop comm
[nid04783:39204] mca: base: close: component alps closed
[nid04783:39204] mca: base: close: unloading component alps
[mom12:48155] [[17739,0],0] plm:base:receive stop comm
[mom12:48155] mca: base: close: component alps closed
[mom12:48155] mca: base: close: unloading component alps

@hppritcha
Copy link
Member

does the application launch if you use aprun rather than mpirun?

@hppritcha
Copy link
Member

sorry didn't see the note above about working with aprun.

@hppritcha
Copy link
Member

I only have access to a system running PBSpro but will see if I can reproduce this.

@hppritcha
Copy link
Member

@devreal what qsub arguments are you using?

@hppritcha
Copy link
Member

@devreal could you submit a new job and paste the PBS_NODEFILE content?

@devreal
Copy link
Contributor Author

devreal commented Oct 30, 2018

qsub command:

$ qsub -I -X -lnodes=2,walltime=00:19:00 -q test

PBS Nodefile:

$ cat $PBS_NODEFILE
2015
2016

I would expect that ALPS will provide Open MPI with the correct allocation information as that is (from what I understand) the way aprun gets it's information.

@rhc54
Copy link
Contributor

rhc54 commented Oct 30, 2018

FWIW: that tells OMPI to assign 1 slot from each of those nodes. We only auto-detect slots in unmanaged environments. So you would indeed get 1 slot if the ras/tm component is used.

I don't know how the info gets filtered thru ALPS and into aprun. Best guess is that ALPS is also seeing only node assignments, but aprun defaults to assuming slots=cores - which is not what OMPI assumes in a managed environment.

@hppritcha
Copy link
Member

could you add a ppn parameter, like

qsub -I -X -lnodes=2,ppn=32, ....

alps is being told by PBS that you're only requesting 1 process per node. I believe this may be a site dependent configuration. That being said, for PBS pro, I always tell pbs exactly how many slots/node I want, e.g.

qsub -I -l place=scatter,select=2:ncpus=32:mpiprocs=32:plset=in32

@devreal
Copy link
Contributor Author

devreal commented Oct 30, 2018

Our PBS does not allow me to define the number of processes per node:

$ qsub -I -X -lnodes=2,ppn=24,walltime=00:19:00 -q test
qsub: submit error (Unknown resource type  Resource_List.ppn)

It seems to be a site-specific configuration issue and not a problem of Open MPI then. I guess I will go with the --oversubscribe then.

@hppritcha
Copy link
Member

closing as this does not appear to be either of Open MPI nor a Cray problem, but something about how the site has configured PBS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants