OFI: move OFI_RETRY_UNTIL_DONE to common #12412

hppritcha · 2024-03-18T14:04:12Z

In running some MPI stress tests on a HPE SS11 network we had to fall back to using the OFI BTL path. That in turn revealed some places in the BTL where we need to use a function similar to the MTL_OFI_RETRY_UNTIL_DONE macro.

So as a first step move this macro to ofi common layer and invoke the more general opal_progress function. That's the content of this PR

Additional changes needed to the OFI BTL will applied in subsequent PRs.

The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level.

wenduwan · 2024-03-18T14:26:25Z

The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level.

@hppritcha Could you elaborate on the problem? What is the queue needed for?

hppritcha · 2024-03-18T14:37:35Z

well what appears to happen is that there are some communication patterns (like each MPI process posting a large number of isends) before doing any other MPI calls that may "sink" some of these, i.e. probing or receiving, that the OFI CXI provider gets plugged up and itself can't make progress. So our OFI MTL gets plugged up spinning on call to the fi_tsend or equivalent. Doesn't help to call fi_cqread owing to this pluggage in the OFI CXI provider. It appears to the application that they get "stuck" in a MPI_Isend. With the OMPI OB1 PML this isn't a problem since it has a add_request_to_send_pending method that allows for the case when sends cannot be absorbed by the OFI provider.

wenduwan · 2024-03-18T14:53:51Z

May I ask about the scale of the application? I want to understand if the problem is unique to CXI or common with EFA.

Is the reason understood for stuck progress?

hppritcha · 2024-03-18T15:03:49Z

it only takes around 400-450 MPI processes actually to observe the hang behavior.

wenduwan · 2024-03-18T15:07:29Z

it only takes around 400-450 MPI processes actually to observe the hang behavior.

Hmmmm this is not particularly large in any sense. Do you have a reproducer program for me to try?

Is it possible to be a libfabric bug?

wenduwan

Minor formatting issue.

ompi/mca/mtl/ofi/mtl_ofi.h

hppritcha · 2024-03-18T15:12:59Z

The behavior of the CXI provider (and its a little confusing because what's in the libfabric github repo https://github.com/ofiwg/libfabric/tree/main/prov/cxi/src isn't exactly what's installed on our system), indicates there's some kind of resource (send ids or something used for rendezvous) that gets exhausted until something happens on the receiver end of the message (like i said above probing for a message or posting a receive) to return one or more of these IDs. I doubt EFA encounters this problem. I am pretty sure this isn't a problem with the generic libfabric functionality like linked list support, etc.

There's nothing wrong per se with app's comm pattern. Probably not the smartest way to do things. But, it worked at scale on our omnipath (PSM2) clusters and older Cray XC/XE systems as well as on IBM power9/ib systems using spectrum MPI.

wenduwan · 2024-03-18T15:19:15Z

@hppritcha Thanks for shedding lights on the problem.

In that case I guess it's worthy fixing the CXI rendezvous protocol(might be harder). I feel somewhat uneasy about introducing another layer of overhead in MTL - in fact we are looking for ways to strip down indirection and let the network handle bursts.

hppritcha · 2024-03-18T15:31:19Z

I'd prefer to leave the OFI MTL alone as well for now - modulo this change.

In running some MPI stress tests on a HPE SS11 network we had to fall back to using the OFI BTL path. That in turn revealed some places in the BTL where we need to use a function similar to the MTL_OFI_RETRY_UNTIL_DONE macro. So as a first step move this macro to ofi common layer and invoke the more general opal_progress function. That's the content of this PR Additional changes needed to the OFI BTL will applied in subsequent PRs. The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level. Signed-off-by: Howard Pritchard <[email protected]>

hppritcha requested a review from wenduwan March 18, 2024 14:04

github-actions bot added the Target: main label Mar 18, 2024

wenduwan approved these changes Mar 18, 2024

View reviewed changes

ompi/mca/mtl/ofi/mtl_ofi.h Outdated Show resolved Hide resolved

hppritcha force-pushed the move_ofi_retry_macro branch from b0b9bdc to 9c54bb0 Compare March 18, 2024 18:08

hppritcha merged commit 42e1cd8 into open-mpi:main Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OFI: move OFI_RETRY_UNTIL_DONE to common #12412

OFI: move OFI_RETRY_UNTIL_DONE to common #12412

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024 •

edited

Loading

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

wenduwan left a comment

Uh oh!

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

Uh oh!

OFI: move OFI_RETRY_UNTIL_DONE to common #12412

OFI: move OFI_RETRY_UNTIL_DONE to common #12412

Uh oh!

Conversation

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

wenduwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

wenduwan commented Mar 18, 2024

Uh oh!

hppritcha commented Mar 18, 2024

Uh oh!

Uh oh!

hppritcha commented Mar 18, 2024 •

edited

Loading