Skip to content

ompi/info: introduce support for the mpi_memory_alloc_kinds info object (II) #13055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 11, 2025

Conversation

edgargabriel
Copy link
Member

@edgargabriel edgargabriel commented Jan 27, 2025

This PR introduces support for the mpi_memory_alloc_kinds and mpi_assert_memory_alloc_kinds info objects as defined in the MPI 4.1 document and in the side document specifying the values for the three supported accelerator kinds.
The logic is surprisingly twisted and I would not be surprised if some of the code here would have to be revised over time.

There are two ways on how the user can specify the mpi_memory_alloc_kinds info object: either as a runtime argument to mpiexec, namely -memory-alloc-kinds, or as part of the info-object passed to MPI_Session_init. There are two predefined memor-alloc-kinds, namely system and mpi, the latter having three potential restrictors. When user retrieves the info object e.g. using MPI_Comm_get_info on MPI_COMM_WORLD, the MPI library is allowed to return more memory-kinds than requested by the user.

This freedom has been applied in the following manner here:

  • if the user does not request the system and/or mpi memory-alloc-kinds, we add them to the list of provided memkinds
  • if the user does not request one of the GPU memory kinds (cuda, ze, rocm), we do not add them to the list of supported memkinds, even if the library has been compiled with GPU support and it would be available. The goal is in a second commit to allow to use this mechanism to turn off GPU support in the library if support for GPU memory has not been requested by the user. This commit doesn't do that yet however.
  • All unknown/unsupported memkinds are removed and not returned when retrieving the info object.

This default provided memkind is applied to all new {Comm/File/Window} objects, unless the user sets mpi_assert_memory_alloc_kinds during object creation (e.g. MPI_Comm_dup_with_info or MPI_File_open). The user can restrict the memory kinds supported by the object with mpi_assert_memory_alloc_kinds, i.e. setting this info object will influence mpi_memory_alloc_kinds on that object. (I am not sure whether there is another example in the MPI spec where providing one info object influences the value of another info object, but the MPI 4.1 specification is pretty clear in my understanding that this is what is expected to happen.)

Another difference in the handling of mpi_assert_memory_alloc_kinds is that if a provided value is not recognized, the entire mpi_assert_memory_alloc_kinds is dropped/ignored, not just the unrecognized memkind itself. (This is my reading of the specification on what is expected to happen).


Lastly, there could probably be some discussion whether the location chosen for the majority of the code is appropriate (i.e. ompi/info), a directory which so-far has contained the code to manage the MPI_Info object in general. After some discussion I thought this is the most appropriate locations, since

  • this is MPI related know-how, i.e. it cannot go into opal
  • the code is general and has to be used by Communicators, Windows, Files, and Sessions/Instances

@edgargabriel
Copy link
Member Author

edgargabriel commented Jan 27, 2025

As a side note, I do have a bunch of tests that I developed as part of the project, and we might need to find a good spot for where to put them. The tests do require some human interaction and the output is platform dependent, but its a good starting point in my opinion.

For the World model:

egabriel@amdhome:~/memkind-tests$ mpirun --mca coll ^hcoll --memory-alloc-kinds system,x:1,y:1,y:2,rocm:device,x:asd  -np 2 ./memkind_test
Comm: MPI_COMM_SELF rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_COMM_WORLD rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_COMM_SELF rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Comm_dup rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Comm_dup after Comm_set_info rank: 0 Value of memory_alloc_kind info object is rocm,system,mpi
Comm: MPI_Comm_dup_with_info & INFO_NULL rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Comm_dup_with_info & info_assert rank: 0 Value of memory_alloc_kind info object is mpi
Comm: MPI_Comm_dup_with_info & info_assert rank: 0 Value of assert_memory_alloc_kind info object is mpi
Comm: MPI_Comm_dup of comm_dup'ed with info_assert rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Comm_dup_with_info & info_invalid rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device

Comm: MPI_Comm_create rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device

Comm: MPI_Intercomm_create rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Intercomm_create rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
Comm: MPI_Intercomm_merge rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device

File: MPI_File_open with INFO_NULL rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device
File: MPI_File_open with info_assert rank: 0 Value of memory_alloc_kind info object is mpi
File: MPI_File_open with info_invalid rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device

Win: MPI_Win_create rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm:device

and for Sessions:

egabriel@amdhome:~/memkind-tests$ mpirun --mca coll ^hcoll  -np 2 ./memkind_test_sessions
Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm
Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm
Comm: Comm_create_from_group with INFO_NULL rank: 0 Value of memory_alloc_kind info object is system,mpi:alloc_mem,mpi:win_allocate,mpi:win_allocate_shared,rocm
Comm: Comm_create_from_group with info_assert rank: 0 Value of memory_alloc_kind info object is mpi

@hppritcha
Copy link
Member

Why does HCOLL need to be excluded?

@edgargabriel
Copy link
Member Author

Why does HCOLL need to be excluded?

hcoll needs to be excluded if you compile with ROCm support, they don't seem to like each other

return num_unique;
}

#if 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to leave this debug code in the PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the code in there, I am 99.9% sure that we will have use for that in the near future. Furthermore, there are other components/code parts that also have dump functionality in the code base. The #if 0 is just to get rid of compiler warning about unused static function

@edgargabriel edgargabriel force-pushed the topic/memkind-support branch 4 times, most recently from b495610 to 3314f1d Compare February 10, 2025 16:58
@edgargabriel
Copy link
Member Author

I have updated the PR based on the feedback that I received on the hybrid WG mailing list.
Specifically:

  • we now return all the memkinds requested by the user, even if they overlap. For example, before this updated patch, if the user requested "mpi,mpi:alloc_mem", we only returned "mpi" (since it was the superset that includes the second memkind requested). Now we return both strings requested by the user, since it seems like the language in MPI 4.1 does request this behavior explicitely.
  • the memkind value on derived communicators is now not derived from the parent communicator/object anymore, but from the instance. This change impacts communicators that were created on a parent communicator that had its memkind values changed freom the default one, e.g. MPI_Comm_dup_with_info + mpi_assert_memory_alloc_kind

The third item, namely clarification of the behavior of info objects passed to MPI_Comm_spawn and friends is still pending with the MPI Forum. I would suggest to not hold up this pr for that purpose, we can fix that retro-actively.

@edgargabriel edgargabriel force-pushed the topic/memkind-support branch 2 times, most recently from f1d9f85 to 3f7fef1 Compare February 10, 2025 18:09
Comment on lines 704 to 706
newcomp->super.s_info = OBJ_NEW(opal_info_t);
ompi_info_memkind_copy_or_set (&comm->instance->super, &newcomp->super, info);
if (info) {
newcomp->super.s_info = OBJ_NEW(opal_info_t);
opal_info_dup(info, &(newcomp->super.s_info));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we dup the provided info before handling the memkind? Aren't we risking to overwrite the memkind info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct, I reverted the order

Comment on lines 399 to 407
static void mca_accelerator_cuda_ze_memkind (ompi_memkind_t *memkind)
{
memkind->im_name = NULL;
memkind->im_no_restrictors = false;
memkind->im_num_restrictors = 0;

return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void mca_accelerator_cuda_ze_memkind (ompi_memkind_t *memkind)
{
memkind->im_name = NULL;
memkind->im_no_restrictors = false;
memkind->im_num_restrictors = 0;
return;
}
static void mca_accelerator_null_memkind (ompi_memkind_t *memkind)
{
memkind->im_name = NULL;
memkind->im_no_restrictors = false;
memkind->im_num_restrictors = 0;
return;
}

We don't need to return system here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see where this callback is used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, this (and your comment below for the ze component) were originally fixed in the version before I did the rebase on the wrong machine and the force-push lost that work. Anyway, it is fixed now again, thanks!

You are btw. correct that we register the get_memkind() method with the null component although it is not actually invoked from anyway, its just for symmetry/completeness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'system' and 'mpi' memkinds are set in the info_memkind.c, independent of the accelerator component used.

@edgargabriel edgargabriel force-pushed the topic/memkind-support branch 3 times, most recently from 084c609 to cdba5b2 Compare February 10, 2025 21:24
@edgargabriel
Copy link
Member Author

@devreal thank you for your review, I think I incorporated all your requests/suggestions. Whenever you have a few minutes, can you please re-review?

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, some more things I didn't notice earlier. I'm ok with not having an MCA parameter for now but I find having the undocumented environment variable strange.

Comment on lines 464 to 484
const char *ompi_info_memkind_cb (opal_infosubscriber_t *obj, const char *key, const char *value)
{
return value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return NULL here so signal that the info value stored in the object has not changed. Returning value signals that we accepted the new value and that the subscriber mechanism will change the stored value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing it to return NULL instead of value, in that case however it stops working, it looks like the callback is also invoked in the initial setting of the value (which doesn't seem to occur when I return NULL instead of the value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, I forgot about the first time a value is set. How about testing for whether the value is set and rejecting any subsequent attempts? We don't want users to be able to change the key AFAIR.

Suggested change
const char *ompi_info_memkind_cb (opal_infosubscriber_t *obj, const char *key, const char *value)
{
return value;
}
/**
* Callback invoked by the info subscriber mechanism.
* Accepts only the first value set.
*/
const char *ompi_info_memkind_cb (opal_infosubscriber_t *obj, const char *key, const char *value)
{
int flag, valuelen;
opal_info_get_valuelen(obj->s_info, key, &valuelen, &flag);
return (flag && valuelen) ? NULL : value;
}

Copy link
Member Author

@edgargabriel edgargabriel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is correct either. If the user provided mpi-assert-memory-alloc-kinds, we calculate a new string 'value' for mpi-memory-alloc-kinds (e.g. at communicator creation) and want that value to be set, not what was provided (and would be available after the info_dup).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a chat with @devreal and I think we came up with the correct solution for this problem, change has been pushed and integrated into the PR

@edgargabriel edgargabriel force-pushed the topic/memkind-support branch 2 times, most recently from 9052fda to c2c408d Compare February 11, 2025 15:43
add code to handle the memkind info objects defined in MPI 4.1

Signed-off-by: Edgar Gabriel <[email protected]>
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@edgargabriel
Copy link
Member Author

bot:retest

@edgargabriel
Copy link
Member Author

/azp run

edgargabriel and others added 2 commits February 11, 2025 13:21
add an API to the accelerator component to retrieve the
memory_alloc_kind information that is supported by the component.
The values stored/returned are based on the side document that is
about to be ratified, see

https://github.com/mpi-forum/mem-alloc/blob/main/mem_alloc.tex

Signed-off-by: Edgar Gabriel <[email protected]>
in intercomm_create_from_group, we need to set the grp_instance value for the leader_group.
This issue was exposed with the new mmemkind code and the mpi4py testsuite.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel merged commit 4f87c6b into open-mpi:main Feb 11, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants