Skip to content

Map by node yield on idle is broken again #2585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artpol84 opened this issue Dec 15, 2016 · 12 comments
Closed

Map by node yield on idle is broken again #2585

artpol84 opened this issue Dec 15, 2016 · 12 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Dec 15, 2016

@rhc54 @jladd-mlnx @hppritcha @jsquyres in the following PR #1489 we fixed the case where map-by node option was forcing "oversubscription" flag to be set which subsequently caused ompi_mpi_yield_when_idle to be dropped.

Now I see that this same problem was re-introduced again in 437f5b4. Why does this keeps happening? This was already propagated to the v2.x. To v2.0.x mentioned PR wasn't ported.

@rhc54
Copy link
Contributor

rhc54 commented Dec 15, 2016

why is the world not flat? who knows? things happen

I'll take another look - world isn't ending.

@artpol84
Copy link
Contributor Author

I think that the part of the issue is that corresponding PR #1540 was submitted at Apr, 13 9:22 AM and merged at Apr 13 10:29 AM without any review requests.

@rhc54
Copy link
Contributor

rhc54 commented Dec 15, 2016

Relax, dude - we don't do reviews on master, and this fixed another problem. Key now is to find the way to fix this one without breaking the other use-case.

🍻

@artpol84
Copy link
Contributor Author

The fact that we don't do reviews for master doesn't mean that we need to avoid them :).
I'd be happy to have a look at the things that are related to the ORTE and pmix when I have the time. Please, reference me on such PR's in future.

@rhc54
Copy link
Contributor

rhc54 commented Dec 15, 2016

😆 I have no idea how to judge when someone will have the time or not, how long to wait to see, what is worth referring to whom, etc. I'm looking at it now to see if I can replicate the problem. Not sure I agree with my earlier fix, so I'll poke around some more.

Meantime, it really would help if you folks could fix the UCX breakage - otherwise, we have no choice but to ignore that test.

@artpol84
Copy link
Contributor Author

artpol84 commented Dec 15, 2016

Just reference me, and I'll do my best to be responsible and respond with comments or will let you know that I am unable to do this.

As for UCX - we are working on this now. API has changed and not all of the PR's are merged. I would suggest to wait for this to be fixed instead of ignoring. History clearly shows that Mellanox jenkins is capable of catching very complicated things. I would wait in favor of code correctness.

@rhc54
Copy link
Contributor

rhc54 commented Dec 15, 2016

We can argue all this off-line someday. Meantime, I'm unable to replicate this issue:

$ mpirun -host rhc001,rhc002 -map-by node --display-devel-map -n 4 hostname
 Data for JOB [12729,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYNODE:NOOVERSUBSCRIBE  Ranking policy: NODE
 Binding policy: SOCKET:IF-SUPPORTED  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
 	Num new daemons: 0	New daemon starting vpid INVALID
 	Num nodes: 2

 Data for node: rhc001	State: 3	Flags: 19
 	Daemon: [[12729,0],0]	Daemon launched: True
 	Num slots: 24	Slots in use: 2	Oversubscribed: FALSE
 	Num slots allocated: 24	Max slots: 0
 	Num procs: 2	Next node_rank: 2
 	Data for proc: [[12729,1],0]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
 		State: INITIALIZED	App_context: 0
 		Locale:  NODE
 		Binding: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
 	Data for proc: [[12729,1],2]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 2
 		State: INITIALIZED	App_context: 0
 		Locale:  NODE
 		Binding: [../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]

 Data for node: rhc002	State: 1	Flags: 1b
 	Daemon: [[12729,0],1]	Daemon launched: True
 	Num slots: 24	Slots in use: 2	Oversubscribed: FALSE
 	Num slots allocated: 24	Max slots: 0
 	Num procs: 2	Next node_rank: 2
 	Data for proc: [[12729,1],1]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 1
 		State: INITIALIZED	App_context: 0
 		Locale:  NODE
 		Binding: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
 	Data for proc: [[12729,1],3]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 3
 		State: INITIALIZED	App_context: 0
 		Locale:  NODE
 		Binding: [../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
rhc001
rhc001
rhc002.cluster
rhc002.cluster

Note that the "oversubscribed" flag is false. Can you post me the output showing how you got something different?

@artpol84
Copy link
Contributor Author

artpol84 commented Dec 15, 2016

The problem is not in the oversubscription itself. Here is the demonstration. Run the following script:

$ cat test.sh
#!/bin/bash

echo "yield_enabled="$OMPI_MCA_mpi_yield_when_idle

And here are few different launches:

mpirun --npernode 1 -np 2 --bind-to core --mca btl self --mca spml ucx --mca pml ucx --mca mpi_add_procs_cutoff 0 --mca pmix_base_async_modex 1 --mca pmix_base_collect_data 0 ./test.sh
yield_enabled=0
yield_enabled=0
mpirun --map-by node -np 2 --bind-to core --mca btl self --mca spml ucx --mca pml ucx ./test.sh
yield_enabled=0
yield_enabled=0

OMPI_MCA_mpi_yield_when_idle that is explicitly set to 0 forces it to override the default true value
I even see this with map-by core. So the issue seems wider now if I'm not mistaken:

mpirun --map-by core -np 32 --bind-to core --mca btl self --mca spml ucx --mca pml ucx ./test.sh
yield_enabled=0
yield_enabled=0
yield_enabled=0
. . .

Launch under PMI2 doesn't introduce this (obviously) and all works fine:

env OMPI_MCA_pml=ucx OMPI_MCA_spml=ucx OMPI_MCA_btl=self srun --cpu_bind=cores --mpi=pmix -N 2 --ntasks-per-node=1 ./test.sh
yield_enabled=
yield_enabled=

@artpol84
Copy link
Contributor Author

This is set in the schizo:
https://github.com/open-mpi/ompi/blob/master/orte/mca/schizo/ompi/schizo_ompi.c#L839:L844
when oversubscription is not detected. So my original conclusion may be incorrect. It seems like we always decide not to yield if we see no oversubscription.

@artpol84
Copy link
Contributor Author

artpol84 commented Dec 15, 2016

But this doesn't work the same way for PMI2 launches which confused me.

@artpol84
Copy link
Contributor Author

Ok, first of all @rhc54 my apologize - it seems that functionality is not broken.

At the time of #1489 the line:
https://github.com/open-mpi/ompi/pull/1489/files#diff-644b3764c21e7439ae8b6555d270a248R114
was important to ensure that if map-by node is enabled but no default binding was provided we need to apply the default binding. At that time application end up unbinded in this case.

However now it looks like the binding is applied if map-by node and no explicit bind-to was specified. At least on my machine:

mpirun --map-by node -np 2 --display-devel-map sleep 1
 Data for JOB [60598,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYNODE:NOOVERSUBSCRIBE  Ranking policy: NODE
 Binding policy: CORE:IF-SUPPORTED  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
        Num new daemons: 0      New daemon starting vpid INVALID
        Num nodes: 2

 Data for node: cn1   State: 3
        Daemon: [[60598,0],1]   Daemon launched: True
        Num slots: 28   Slots in use: 1 Oversubscribed: FALSE
        Num slots allocated: 28 Max slots: 0
        Num procs: 1    Next node_rank: 1
        Data for proc: [[60598,1],0]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
                State: INITIALIZED      App_context: 0
                Locale:  NODE
                Binding: [B/././././././././././././.][./././././././././././././.]

 Data for node: cn2   State: 3
        Daemon: [[60598,0],2]   Daemon launched: True
        Num slots: 28   Slots in use: 1 Oversubscribed: FALSE
        Num slots allocated: 28 Max slots: 0
        Num procs: 1    Next node_rank: 1
        Data for proc: [[60598,1],1]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 1
                State: INITIALIZED      App_context: 0
                Locale:  NODE
                Binding: [B/././././././././././././.][./././././././././././././.]

@rhc54
Copy link
Contributor

rhc54 commented Dec 15, 2016

Yeah, that sounds right. No worries - glad you were able to work it out. MPI is always aggressive, and so we don't yield unless we determine we really have to, or the user tells us that's what they want.

@rhc54 rhc54 closed this as completed Dec 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants