Skip to content

Update to hwloc 2.0.0a #3951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 1, 2017
Merged

Update to hwloc 2.0.0a #3951

merged 2 commits into from
Aug 1, 2017

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jul 21, 2017

Silence some osc/rdma warnings - @hjelmn please comment

Signed-off-by: Ralph Castain [email protected]

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 21, 2017

@bosilca There are a couple of places in treematch that need to be updated, but they reside upstream. @ggouaillardet set flags to generate compile-time warnings due to the deprecated hwloc functions.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 21, 2017

@jsquyres please check distance calculations in usnic

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 22, 2017

@ggouaillardet @hjelmn Can someone look at the openib btl distance calculation? It obviously didn't translate correctly to the new API.

@open-mpi open-mpi deleted a comment from ibm-ompi Jul 23, 2017
@bgoglin
Copy link
Contributor

bgoglin commented Jul 24, 2017

2 things to fix in the distance code:

  • when you pass &nr, &matrix to get a distance matrix, nr should be set to 1 before the call. If nr == 0 before the call, you'll get nr updated but nothing in matrix (which is a good way to find out if there's a matrix without retrieving it and having to release it later). See the doc in hwloc/distances.h
 * On input, \p nr points to the number of distances that may be stored in \p distances.
 * On output, \p nr points to the number of distances that were actually found, even if some of them couldn't be stored in \p distances.
 * Distances that couldn't be stored are ignored, but the function still returns
 * success (\c 0). The caller may find out by comparing the value pointed by \p nr before and after the function call.
  • Objects are not in logical order in the matrix. You should iterate over the obj array to find the index i and j of your objects, and then use those indexes to get the distance from matrix[i+j*nbobjs]. I'll clarify that in the doc.

@jsquyres
Copy link
Member

It doesn't look like this PR is binding processes properly.

When I build thusly:

$ ./configure --prefix=/home/jsquyres/bogus --disable-oshmem --with-usnic --with-libfabric=/home/jsquyres/libfabric-current/install --enable-mpirun-prefix-by-default --enable-debug --enable-mem-debug --enable-mem-profile --disable-mpi-fortran --enable-debug --enable-mem-debug --enable-picky

And when I run inside a SLURM salloc thusly:

$ $ mpirun --report-bindings -np 4 --npernode 2 --mca btl vader,self,usnic --mca btl_base_verbose 100 hostname
[mpi005:31684] MCW rank 0 bound to : 
[mpi005:31684] MCW rank 1 bound to : 
[mpi006:17919] MCW rank 2 bound to : 
[mpi006:17919] MCW rank 3 bound to : 
mpi005
mpi005
mpi006
mpi006

you can see that the --report-bindings output is empty.

Hmm. After more testing, this is even happening on master (without this PR). Even if I specify --bind-to:

# Do not specify a binding
$ mpirun --report-bindings -np 4 --npernode 2 --mca btl vader,self,usnic --mca btl_base_verbose 100 hostname
[mpi006:18492] MCW rank 2 bound to : 
[mpi006:18492] MCW rank 3 bound to : 
[mpi005:32281] MCW rank 0 bound to : 
[mpi005:32281] MCW rank 1 bound to : 
mpi005
mpi005
mpi006
mpi006

# Bind to core
$ mpirun --bind-to core --report-bindings -np 4 --npernode 2 --mca btl vader,self,usnic --mca btl_base_verbose 100 hostname
[mpi006:18441] MCW rank 2 bound to : 
[mpi006:18441] MCW rank 3 bound to : 
[mpi005:32213] MCW rank 0 bound to : 
[mpi005:32213] MCW rank 1 bound to : 
mpi006
mpi006
mpi005
mpi005

# Bind to socket
$ mpirun --bind-to socket --report-bindings -np 4 --npernode 2 --mca btl vader,self,usnic --mca btl_base_verbose 100 hostname
[mpi005:32246] MCW rank 0 bound to : 
[mpi005:32246] MCW rank 1 bound to : 
[mpi006:18458] MCW rank 2 bound to : 
[mpi006:18458] MCW rank 3 bound to : 
mpi005
mpi005
mpi006
mpi006

The usnic BTL only uses hwloc NUMA distances if the processes are bound, so I can't really test the new hwloc stuff until this binding stuff is fixed.

...am I doing something stupid / wrong?

Copy link
Contributor

@bgoglin bgoglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK for my first comment but you still need to address my second bullet:
"Objects are not in logical order in the matrix. You should iterate over the obj array to find the index i and j of your objects, and then use those indexes to get the distance from matrix[i+j*nbobjs]. I'll clarify that in the doc."

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

@bgoglin Perhaps I am missing something, but we do iterate over the object array to get the indices right before computing the location in the matrix. Is there something else required?

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

@jsquyres I'll check the binding on master - was working last I checked

@bgoglin
Copy link
Contributor

bgoglin commented Jul 25, 2017

I can't find any occurence of "->objs" in changes listed at https://github.com/open-mpi/ompi/pull/3951/files. If you were iterating, we would see things like "if (myobj == distances->objs[x])".

Am I looking at the right place?

In opal/mca/btl/openib/btl_openib_component.c:2480 2486 2516 and 2523, we use logical_index, there's no iteration over the objs array. Same in usnic.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

@bgoglin Okay, I think I'm beginning to understand - your earlier comments confused me as we are looking up the index of the objects we are trying to reference. What you're saying is that they are in a different place in the distances matrix, and so we have to look for them there as well.

Sigh - I can see that is going to be really hard to explain clearly in the docs ☹️ . I'll try to update the code before I run out the door.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

@ggouaillardet It looks like your earlier modifications to handle both hwloc2 and earlier versions isn't quite correct. For example, the per-object mapping (ppr mapper) isn't working correctly any more, and neither is the code that prints out the binding report (opal_hwloc_base_cset2mapstr). I'm heading out on vacation for a few days - can you please track it down and fix?

Your choice of where/how to do it. You can fix it on master if you prefer, or can create a PR against this one.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

@bgoglin Look okay now?

@bgoglin
Copy link
Contributor

bgoglin commented Jul 25, 2017

@rhc54 Yeah thanks. Do you want a helper function hwloc_distances_get_values_between_objs(distances, obj1, obj2, &value1, &value2) ?

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 25, 2017

Might be an idea to make things easier for folks - I don't think there is anything wrong with the current code. I just think it may be hard to communicate to folks. Helper function might eliminate that issue.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 26, 2017

@ggouaillardet I fixed master here: #3963

Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Jul 26, 2017

@jsquyres @ggouaillardet I have brought this completely up-to-date and binding works again for both hwloc 1.11 and 2.0. Please check distance calculations - Brice has looked them over for correctness, but I'd like to ensure we still get the right answers.

@ggouaillardet
Copy link
Contributor

thanks Ralph,

i was working on a very similar patch.

is yours enough to make bind_to_cpuset() from orte/mca/rmaps/base/rmaps_base_binding.c happy again ?
i could not find the trigger to exercise this code path :-(

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 26, 2017

@ggouaillardet yeah, i realized just a little while ago that we still needed to filter the cpus against the user-provided cpu list.

bind_to_cpuset is activated by providing --cpu-list 0-5 to the mpirun cmd line. It binds each proc to that specified list of cpus.

There is still something wrong in the rmaps_base_frame.c logic, though it isn't due to these changes. When you specify a cpu-list, we should allow you to still specify a binding policy inside that envelope. In other words, if you don't specify a binding policy, then we bind each proc to the entire envelope. Otherwise, we are supposed to bind each proc within the cpu-list envelope according to the binding policy.

Right now, rmaps_base_frame.c will error out if you try to specify both --cpu-list and a binding policy. Would you have a chance to poke at that (I need to get some sleep)?

@ggouaillardet
Copy link
Contributor

@rhc54 ok, i will give it a try

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 28, 2017

@jsquyres don't forget to re-try this 😄

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 1, 2017

@jsquyres @hppritcha I would really like to merge this sooner rather than later. So far as we can tell, everything is now working. I won't be on the call this week, but could someone please check to see if anyone has concerns about commit it?

@jsquyres
Copy link
Member

jsquyres commented Aug 1, 2017

Now that binding has been fixed, this passes all the sanity tests that I manually checked.

@jsquyres
Copy link
Member

jsquyres commented Aug 1, 2017

@rhc54 No one had any objections to this PR on the Webex today. Question, though: does the external hwloc component handle hwloc x2.x? Or does it only handle hwloc v1.x? I ask because it would probably be a little weird to have an internal hwloc that supports hwloc v2.x but then our external hwloc component doesn't support hwloc v2.x.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 1, 2017

The external component handles both hwloc v1.x and v2.x release branches. Reason: the abstraction glue written by @ggouaillardet and myself is based on macros and direct code all conditioned on the HWLOC version number. So there is no abstraction glue in the components themselves.

@rhc54 rhc54 merged commit f39ce67 into open-mpi:master Aug 1, 2017
@rhc54 rhc54 deleted the topic/hwloc2 branch August 1, 2017 21:18
@jsquyres
Copy link
Member

jsquyres commented Aug 1, 2017

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants