Skip to content

ompi/runtime: add instead of set opal_progress_event_flags #9573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 2, 2021

Conversation

wzamazon
Copy link
Contributor

@wzamazon wzamazon commented Oct 20, 2021

Currently, ompi_mpi_init() call

 opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK),

whith the intention to ensure OPAL_EVLOOP_NONBLOCK is set
in opal_progress_event_flag.

However, this call will remove other existing flags
(like OPAL_EVLOOP_ONCE) in opal_progress_event_flag,
which can cause deadlock.

This patch address the issue by adding OPAL_EVLOOP_NONBLOCK
to that flag.

Signed-off-by: Wei Zhang [email protected]
(cherry picked from commit f22d897)

Currently, ompi_mpi_init() call

     opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK),

whith the intention to ensure OPAL_EVLOOP_NONBLOCK is set
in opal_progress_event_flag.

However, this call will remove other existing flags
(like OPAL_EVLOOP_ONCE) in opal_progress_event_flag,
which can cause deadlock.

This patch address the issue by adding OPAL_EVLOOP_NONBLOCK
to that flag.

Signed-off-by: Wei Zhang <[email protected]>
(cherry picked from commit f22d897)
@wzamazon wzamazon requested a review from bosilca October 20, 2021 14:11
@wzamazon
Copy link
Contributor Author

gcc9 build failed. However, the error message is vague

 ...
 CC       pbsend_init_f.lo
  CC       pbuffer_attach_f.lo
  CC       pbuffer_detach_f.lo
  CC       pcancel_f.lo
  CC       pcart_coords_f.lo
  CC       pcart_create_f.lo
  CC       pcartdim_get_f.lo
  CC       pcart_get_f.lo
  CC       pcart_map_f.lo
  CC       pcart_rank_f.lo
  CC       pcart_shift_f.lo
  CC       pcart_sub_f.lo
  CC       pclose_port_f.lo
FATAL: command execution failed
Command Close created at
	at hudson.remoting.Command.<init>(Command.java:70)
	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1312)
	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1310)
	at hudson.remoting.Channel.close(Channel.java:1486)
	at hudson.remoting.Channel.close(Channel.java:1453)
	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1318)
	at hudson.remoting.Channel$1.handle(Channel.java:607)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:81)
...

@jsquyres
Copy link
Member

That usually means a Jenkins worker died abnormally. Just try running the test again.

@jsquyres
Copy link
Member

bot:ompi:retest

@gpaulsen
Copy link
Member

bot:aws:retest

@wzamazon
Copy link
Contributor Author

All CI tests passed. Ready for review.

@awlauria awlauria added this to the v5.0.0 milestone Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants