-
Notifications
You must be signed in to change notification settings - Fork 920
OFI: move OFI_RETRY_UNTIL_DONE to common #12412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@hppritcha Could you elaborate on the problem? What is the queue needed for? |
well what appears to happen is that there are some communication patterns (like each MPI process posting a large number of isends) before doing any other MPI calls that may "sink" some of these, i.e. probing or receiving, that the OFI CXI provider gets plugged up and itself can't make progress. So our OFI MTL gets plugged up spinning on call to the |
May I ask about the scale of the application? I want to understand if the problem is unique to CXI or common with EFA. Is the reason understood for stuck progress? |
it only takes around 400-450 MPI processes actually to observe the hang behavior. |
Hmmmm this is not particularly large in any sense. Do you have a reproducer program for me to try? Is it possible to be a libfabric bug? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor formatting issue.
The behavior of the CXI provider (and its a little confusing because what's in the libfabric github repo https://github.com/ofiwg/libfabric/tree/main/prov/cxi/src isn't exactly what's installed on our system), indicates there's some kind of resource (send ids or something used for rendezvous) that gets exhausted until something happens on the receiver end of the message (like i said above probing for a message or posting a receive) to return one or more of these IDs. I doubt EFA encounters this problem. I am pretty sure this isn't a problem with the generic libfabric functionality like linked list support, etc. There's nothing wrong per se with app's comm pattern. Probably not the smartest way to do things. But, it worked at scale on our omnipath (PSM2) clusters and older Cray XC/XE systems as well as on IBM power9/ib systems using spectrum MPI. |
@hppritcha Thanks for shedding lights on the problem. In that case I guess it's worthy fixing the CXI rendezvous protocol(might be harder). I feel somewhat uneasy about introducing another layer of overhead in MTL - in fact we are looking for ways to strip down indirection and let the network handle bursts. |
I'd prefer to leave the OFI MTL alone as well for now - modulo this change. |
In running some MPI stress tests on a HPE SS11 network we had to fall back to using the OFI BTL path. That in turn revealed some places in the BTL where we need to use a function similar to the MTL_OFI_RETRY_UNTIL_DONE macro. So as a first step move this macro to ofi common layer and invoke the more general opal_progress function. That's the content of this PR Additional changes needed to the OFI BTL will applied in subsequent PRs. The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level. Signed-off-by: Howard Pritchard <[email protected]>
b0b9bdc
to
9c54bb0
Compare
In running some MPI stress tests on a HPE SS11 network we had to fall back to using the OFI BTL path. That in turn revealed some places in the BTL where we need to use a function similar to the MTL_OFI_RETRY_UNTIL_DONE macro.
So as a first step move this macro to ofi common layer and invoke the more general opal_progress function. That's the content of this PR
Additional changes needed to the OFI BTL will applied in subsequent PRs.
The OFI MTL will require more work as the situation hit with the HPE CXI provider indicates a need to implement some kind of send backlog queueing mechanism in the MTL rather than simply spinning on the OFI CQs hoping for progress at the OFI provider level.