Skip to content

Avoid calling twice into opal_mutex_lock() in ompi_mpi_instance_init() #10720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

janciesko
Copy link
Contributor

Avoid deadlock by calling opal_mutex_lock twice on same thread. Since ompi_mpi_instance_init_common is called only from ompi_mpi_instance_init, this refactor should be otherwise equivalent.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@devreal
Copy link
Contributor

devreal commented Aug 25, 2022

ok to test

@janciesko janciesko force-pushed the fix_instance_locking branch from 8e90958 to 4f604e7 Compare August 26, 2022 17:15
@devreal
Copy link
Contributor

devreal commented Aug 26, 2022

I'm not sure if this is connected to your patch but there seem to be instances where the instance_lock is released twice:

instance_lock is a recursive lock and the title makes it sound like you're removing lock calls...

@janciesko
Copy link
Contributor Author

janciesko commented Aug 26, 2022

Good point, the PR might se side stepping the issue then. It might be that the recursive lock does not work. Currently, this sequence deadlocks.

opal_mutex_lock(&instance_lock);
opal_mutex_lock(&instance_lock);

@gpaulsen
Copy link
Member

@janciesko What's your plan with this PR?

@janciesko
Copy link
Contributor Author

This error is specific to the Qthreads ULT support - currently Qthreads does not support recursive mutex. The correct fix seems to be to add recursive mutex support to Qthreads. This would avoid the deadlock in ompi for that sequence of lock acquisitions.

@janciesko
Copy link
Contributor Author

We can close this PR and I will open a new PR with the updated calls to Qthreads recursive locking.

@awlauria awlauria closed this Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants