Skip to content

Conversation

hppritcha
Copy link
Member

In running some MPI stress tests on a HPE SS11 network we had to fall back to using the OFI BTL path. That in turn revealed some places in the BTL where we need to use a function similar to the MTL_OFI_RETRY_UNTIL_DONE macro.

So as a first step move this macro to ofi common layer and invoke the more general opal_progress function. That's the content of this PR

Additional changes needed to the OFI BTL will applied in subsequent PRs.

The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level.

@wenduwan
Copy link
Contributor

The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level.

@hppritcha Could you elaborate on the problem? What is the queue needed for?

@hppritcha
Copy link
Member Author

hppritcha commented Mar 18, 2024

well what appears to happen is that there are some communication patterns (like each MPI process posting a large number of isends) before doing any other MPI calls that may "sink" some of these, i.e. probing or receiving, that the OFI CXI provider gets plugged up and itself can't make progress. So our OFI MTL gets plugged up spinning on call to the fi_tsend or equivalent. Doesn't help to call fi_cqread owing to this pluggage in the OFI CXI provider. It appears to the application that they get "stuck" in a MPI_Isend. With the OMPI OB1 PML this isn't a problem since it has a add_request_to_send_pending method that allows for the case when sends cannot be absorbed by the OFI provider.

@wenduwan
Copy link
Contributor

May I ask about the scale of the application? I want to understand if the problem is unique to CXI or common with EFA.

Is the reason understood for stuck progress?

@hppritcha
Copy link
Member Author

it only takes around 400-450 MPI processes actually to observe the hang behavior.

@wenduwan
Copy link
Contributor

it only takes around 400-450 MPI processes actually to observe the hang behavior.

Hmmmm this is not particularly large in any sense. Do you have a reproducer program for me to try?

Is it possible to be a libfabric bug?

Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issue.

@hppritcha
Copy link
Member Author

The behavior of the CXI provider (and its a little confusing because what's in the libfabric github repo https://github.com/ofiwg/libfabric/tree/main/prov/cxi/src isn't exactly what's installed on our system), indicates there's some kind of resource (send ids or something used for rendezvous) that gets exhausted until something happens on the receiver end of the message (like i said above probing for a message or posting a receive) to return one or more of these IDs. I doubt EFA encounters this problem. I am pretty sure this isn't a problem with the generic libfabric functionality like linked list support, etc.

There's nothing wrong per se with app's comm pattern. Probably not the smartest way to do things. But, it worked at scale on our omnipath (PSM2) clusters and older Cray XC/XE systems as well as on IBM power9/ib systems using spectrum MPI.

@wenduwan
Copy link
Contributor

@hppritcha Thanks for shedding lights on the problem.

In that case I guess it's worthy fixing the CXI rendezvous protocol(might be harder). I feel somewhat uneasy about introducing another layer of overhead in MTL - in fact we are looking for ways to strip down indirection and let the network handle bursts.

@hppritcha
Copy link
Member Author

I'd prefer to leave the OFI MTL alone as well for now - modulo this change.

In running some MPI stress tests on a HPE SS11 network we
had to fall back to using the OFI BTL path.  That in turn
revealed some places in the BTL where we need to use a function
similar to the MTL_OFI_RETRY_UNTIL_DONE macro.

So as a first step move this macro to ofi common layer and
invoke the more general opal_progress function.  That's the
content of this PR

Additional changes needed to the OFI BTL will applied in subsequent
PRs.

The OFI MTL will require more work as the situation hit with
the HPE CXI provider indicates a need to implement some kind of
send backlog queueing mechanism in the MTL rather than simply
spinning on the OFI CQs hoping for progress at the OFI provider
level.

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha force-pushed the move_ofi_retry_macro branch from b0b9bdc to 9c54bb0 Compare March 18, 2024 18:08
@hppritcha hppritcha merged commit 42e1cd8 into open-mpi:main Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants