-
Notifications
You must be signed in to change notification settings - Fork 901
Map by node yield on idle is broken again #2585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
why is the world not flat? who knows? things happen I'll take another look - world isn't ending. |
I think that the part of the issue is that corresponding PR #1540 was submitted at Apr, 13 9:22 AM and merged at Apr 13 10:29 AM without any review requests. |
Relax, dude - we don't do reviews on master, and this fixed another problem. Key now is to find the way to fix this one without breaking the other use-case. 🍻 |
The fact that we don't do reviews for master doesn't mean that we need to avoid them :). |
😆 I have no idea how to judge when someone will have the time or not, how long to wait to see, what is worth referring to whom, etc. I'm looking at it now to see if I can replicate the problem. Not sure I agree with my earlier fix, so I'll poke around some more. Meantime, it really would help if you folks could fix the UCX breakage - otherwise, we have no choice but to ignore that test. |
Just reference me, and I'll do my best to be responsible and respond with comments or will let you know that I am unable to do this. As for UCX - we are working on this now. API has changed and not all of the PR's are merged. I would suggest to wait for this to be fixed instead of ignoring. History clearly shows that Mellanox jenkins is capable of catching very complicated things. I would wait in favor of code correctness. |
We can argue all this off-line someday. Meantime, I'm unable to replicate this issue: $ mpirun -host rhc001,rhc002 -map-by node --display-devel-map -n 4 hostname
Data for JOB [12729,1] offset 0
Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE:NOOVERSUBSCRIBE Ranking policy: NODE
Binding policy: SOCKET:IF-SUPPORTED Cpu set: NULL PPR: NULL Cpus-per-rank: 1
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 2
Data for node: rhc001 State: 3 Flags: 19
Daemon: [[12729,0],0] Daemon launched: True
Num slots: 24 Slots in use: 2 Oversubscribed: FALSE
Num slots allocated: 24 Max slots: 0
Num procs: 2 Next node_rank: 2
Data for proc: [[12729,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Locale: NODE
Binding: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
Data for proc: [[12729,1],2]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 2
State: INITIALIZED App_context: 0
Locale: NODE
Binding: [../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: rhc002 State: 1 Flags: 1b
Daemon: [[12729,0],1] Daemon launched: True
Num slots: 24 Slots in use: 2 Oversubscribed: FALSE
Num slots allocated: 24 Max slots: 0
Num procs: 2 Next node_rank: 2
Data for proc: [[12729,1],1]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
State: INITIALIZED App_context: 0
Locale: NODE
Binding: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
Data for proc: [[12729,1],3]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 3
State: INITIALIZED App_context: 0
Locale: NODE
Binding: [../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
rhc001
rhc001
rhc002.cluster
rhc002.cluster Note that the "oversubscribed" flag is false. Can you post me the output showing how you got something different? |
The problem is not in the oversubscription itself. Here is the demonstration. Run the following script: $ cat test.sh
#!/bin/bash
echo "yield_enabled="$OMPI_MCA_mpi_yield_when_idle And here are few different launches:
Launch under PMI2 doesn't introduce this (obviously) and all works fine:
|
This is set in the schizo: |
But this doesn't work the same way for PMI2 launches which confused me. |
Ok, first of all @rhc54 my apologize - it seems that functionality is not broken. At the time of #1489 the line: However now it looks like the binding is applied if
|
Yeah, that sounds right. No worries - glad you were able to work it out. MPI is always aggressive, and so we don't yield unless we determine we really have to, or the user tells us that's what they want. |
Uh oh!
There was an error while loading. Please reload this page.
@rhc54 @jladd-mlnx @hppritcha @jsquyres in the following PR #1489 we fixed the case where map-by node option was forcing "oversubscription" flag to be set which subsequently caused ompi_mpi_yield_when_idle to be dropped.
Now I see that this same problem was re-introduced again in 437f5b4. Why does this keeps happening? This was already propagated to the v2.x. To v2.0.x mentioned PR wasn't ported.
The text was updated successfully, but these errors were encountered: