ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

wzamazon · 2021-10-06T22:03:16Z

addresses #9447

rhc54

Looks correct to me

bosilca · 2021-10-07T15:27:32Z

ompi/runtime/ompi_mpi_init.c

@@ -999,7 +999,7 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided,
       CPU utilization for the remainder of MPI_INIT when we are
       blocking on RTE-level events, but may greatly reduce non-TCP
       latency. */
-    opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK);
+    opal_progress_add_event_flag(OPAL_EVLOOP_NONBLOCK);


opal_progress_add_event_flag returns the old flag. I would prefer not to add a second function but instead do

int old_flag = opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK); if( !(old_flag & OPAL_EVLOOP_NONBLOCK) ) { opal_progress_set_event_flag( old_flag | OPAL_EVLOOP_NONBLOCK ); }

Thank you! I think this is a great idea! I removed the commit the introduced the new function, and implemented the commit as you suggested (with minor adjustment).

bosilca · 2021-10-07T15:35:27Z

opal/runtime/opal_progress.h

+ *               OPAL_EVLOOP_NONBLOCK | OPAL_EVLOOP_ONCE
+ *
+ * OPAL_EVLOOP_NONBLOCK means opal_event_loop() should NOT block on
+ * waiting for active events.


We talked about this on the issue, and this is not exactly what this flag is about, because for as long as there are new events the opal_progress never returns.

I agree. I adjusted the comment in the new revision. The point I wanted to make is that this flag only impact the behavior when there is no active events.

The comment for opal_progress_set_event_flags states that default opal_progress_event_flags is OPAL_EVLOOP_ONELOOP, which is not accurate, because OPAL_EVLOOP_ONELOOP has been removed in commit 33c3b71, and was replaced by OPAL_EVLOOP_NONBLOCK | OPAL_EVLOOP_ONCE. This patch fixed the comment about the default argument. It also fixed a typo in the argument section: "vlags" -> "flags" Signed-off-by: Wei Zhang <[email protected]>

wzamazon · 2021-10-07T18:15:00Z

I ran the following test to verify the patch:

intel mpi benchmark IMB-MPI1 (with btl/tcp).

osu micro benchmark (latency, bw, mbw_mr)

applications like lammps (with the LJ, EAM test case) and hpcg.

bosilca · 2021-10-07T18:35:28Z

ompi/runtime/ompi_mpi_init.c

@@ -393,7 +393,7 @@ static void evhandler_reg_callbk(pmix_status_t status,
 int ompi_mpi_init(int argc, char **argv, int requested, int *provided,
                  bool reinit_ok)
 {
-    int ret;
+    int ret, old_event_flags;


This will lead to compiler warnings when OPAL_ENABLE_PROGRESS_THREADS is not 0

Fixed in new revision.

bosilca · 2021-10-07T18:38:38Z

The comment indicates that this was done with the goal of improving the latency for non TCP cases. The only way for this to affect the latency would be in a setup with multiple recv posted. Did you notice any performance impact with your test ?

Currently, ompi_mpi_init() call opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK), whith the intention to ensure OPAL_EVLOOP_NONBLOCK is set in opal_progress_event_flag. However, this call will remove other existing flags (like OPAL_EVLOOP_ONCE) in opal_progress_event_flag, which can cause deadlock. This patch address the issue by adding OPAL_EVLOOP_NONBLOCK to that flag. Signed-off-by: Wei Zhang <[email protected]>

wzamazon · 2021-10-08T01:46:58Z

@bosilca

I tested was using btl/tcp using osu_latency, osu_bw, osu_allreduce (1152 ranks) and osu_allgather (1152 ranks)

For osu_latency and osu_bw, the performance difference are with in 2 standard deviation.

For osu_allreduce and osu_allgather, the change improved performance for some small message sizes (difference > 2 standard deviation). For other messages, the difference is again within 2 standard deviation.

The following is osu_latency result. (All data is 3 time average, the number in parenthesis is standard deviation).

message size (byte)	latency master (us)	latency change (us)
0	25.997(0.831)	26.297(1.453)
1	26.077(0.861)	26.353(1.438)
2	26.073(0.838)	26.393(1.398)
4	26.103(0.866)	26.363(1.456)
8	26.267(0.926)	26.507(1.437)
16	26.700(0.837)	26.987(1.437)
32	26.723(0.848)	26.980(1.408)
64	26.803(0.871)	27.090(1.446)
128	26.920(0.885)	27.217(1.455)
256	26.887(0.818)	27.113(1.301)
512	27.073(0.831)	27.283(1.382)
1024	27.517(0.806)	27.727(1.358)
2048	28.880(0.875)	29.133(1.493)
4096	32.017(0.900)	32.370(1.596)
8192	36.247(0.922)	36.653(1.598)
16384	43.527(0.549)	44.197(1.010)
32768	62.220(1.135)	63.667(1.809)
65536	144.183(2.560)	148.737(4.473)
131072	160.163(3.794)	165.380(5.524)
262144	207.117(4.628)	213.630(8.208)
524288	308.577(2.488)	313.717(3.570)
1048576	510.897(2.138)	521.623(3.471)
2097152	945.130(5.538)	957.323(8.837)
4194304	2479.843(5.042)	2494.190(4.837)

The following is osu_bw result

message size	master bw (MB/s)	PR bw (MB/s)
1	0.420(0.000)	0.440(0.017)
2	0.843(0.012)	0.883(0.025)
4	1.707(0.032)	1.773(0.049)
8	3.397(0.078)	3.557(0.055)
16	6.923(0.240)	7.300(0.121)
32	12.883(0.159)	13.317(0.803)
64	25.960(0.411)	26.843(1.609)
128	51.387(0.405)	53.673(2.943)
256	100.690(0.159)	104.640(4.947)
512	196.107(1.101)	203.910(8.602)
1024	355.663(3.354)	369.007(13.903)
2048	633.607(1.240)	652.680(27.496)
4096	1045.380(10.372)	1098.347(61.761)
8192	1187.243(0.316)	1187.500(0.279)
16384	1189.383(0.374)	1190.260(0.575)
32768	1190.400(0.287)	1189.970(0.428)
65536	1189.460(0.026)	1189.320(0.092)
131072	1190.247(0.090)	1190.247(0.029)
262144	1190.743(0.035)	1190.777(0.012)
524288	1190.973(0.012)	1190.963(0.006)
1048576	1191.090(0.000)	1191.090(0.000)
2097152	1191.150(0.000)	1191.150(0.000)
4194304	1191.180(0.000)	1191.180(0.000)

The following is osu_allreduce result (1152 ranks)

Message Size (byte)	runtime of master (us)	runtime with change (us)
4	367.493(3.056)	362.583(7.996)
8	372.257(3.376)	359.910(2.336) *
16	376.060(6.276)	366.697(3.107) *
32	382.667(8.927)	366.840(4.064) *
64	381.560(2.428)	365.317(2.168) *
128	378.713(4.709)	368.447(5.622) *
256	381.547(5.734)	366.250(3.176) *
512	390.820(9.925)	378.867(15.143)
1024	390.630(2.761)	404.360(46.346)
2048	406.670(3.026)	413.597(51.617)
4096	428.060(3.152)	429.250(17.812)
8192	482.337(10.986)	464.437(2.115)
16384	849.813(6.832)	833.370(18.671)
32768	886.503(6.343)	863.500(11.619)
65536	956.223(34.957)	936.930(41.237)
131072	1056.793(10.053)	1028.327(3.336)
262144	1443.797(8.492)	1426.460(8.382)
524288	2261.233(27.292)	2261.187(20.286)
1048576	3957.807(23.805)	4086.620(259.036)

The following is osu_allgather result (1152 ranks)

Message Size (byte)	runtime of master (us)	runtime with change (us)
1	427.737(9.860)	416.587(13.676)
2	463.977(59.748)	409.407(2.984)
4	471.393(58.288)	472.467(106.369)
8	450.167(19.312)	443.510(21.062)
16	448.740(5.359)	431.650(0.740)
32	500.977(4.160)	483.120(1.084) *
64	605.590(2.843)	588.547(1.729) *
128	965.243(4.690)	950.857(5.975)
256	1556.810(4.089)	1541.183(7.976)
512	2825.667(10.573)	2819.557(16.394)
1024	5861.893(194.741)	5837.000(222.494)
2048	13155.333(954.938)	12828.160(108.980)
4096	18867.147(209.805)	18402.710(51.116)
8192	23308.963(577.618)	22562.967(297.951)
16384	34344.003(1185.396)	33130.357(729.296)
32768	90554.647(876.167)	90038.390(709.740)
65536	120246.920(1905.017)	119608.250(1074.850)
131072	176621.410(3097.940)	176961.527(1877.809)
262144	297212.507(6643.612)	296812.370(3248.363)

I will run same tests on btl/ofi (on EFA) to get data for non-tcp cases

bosilca · 2021-10-08T02:24:21Z

The patch looks good. Let's wait for the performance results with ofi before merging this.

wzamazon · 2021-10-08T15:32:07Z

I ran the same latency, bw, allreduce (of 1152 ranks), allgather (of 1152 ranks) test on mtl/ofi, the conclusion is similiar to that for btl/tcp:

No significant performance difference shown on osu_latency and osu_bw
No signficant performance difference on osu_allreduce
For some small message size, osu_allgather show minor improvement with the patch (1 to 32 bytes).

The follow is the data:

osu_latency, data show in the table is 3 run average, number in the parenthesis is the standard deviation:

messsage size (byte)	latency of master branch (us)	latency of PR (us)
0	20.040(0.026)	19.440(0.072)
1	20.033(0.012)	19.377(0.015)
2	19.987(0.029)	19.370(0.026)
4	19.987(0.023)	19.347(0.021)
8	20.000(0.026)	19.353(0.031)
16	19.993(0.015)	19.347(0.031)
32	20.027(0.006)	19.363(0.015)
64	20.060(0.036)	19.400(0.044)
128	20.090(0.017)	19.430(0.020)
256	20.163(0.015)	19.487(0.032)
512	20.303(0.042)	19.683(0.031)
1024	20.733(0.006)	20.153(0.015)
2048	21.733(0.012)	21.107(0.012)
4096	23.920(0.050)	23.283(0.012)
8192	28.703(0.050)	28.220(0.050)
16384	31.323(0.081)	30.787(0.244)
32768	35.933(0.429)	35.340(0.602)
65536	43.090(0.383)	42.353(0.864)
131072	100.943(0.762)	99.220(1.897)
262144	139.983(1.858)	139.600(2.843)
524288	248.917(3.746)	247.977(6.629)
1048576	454.467(7.285)	454.170(12.318)
2097152	900.713(18.564)	898.767(9.322)
4194304	1719.867(34.016)	1685.350(23.085)

osu_bw, data show in the table is 3 run average, number in the parenthesis is the standard deviation:

messsage size (byte)	bw of master branch (MB/s)	bwof PR (MB/s)
1	0.667(0.006)	0.673(0.006)
2	1.337(0.006)	1.360(0.010)
4	2.703(0.012)	2.757(0.015)
8	5.413(0.025)	5.520(0.026)
16	10.837(0.042)	11.033(0.080)
32	21.653(0.095)	22.103(0.133)
64	43.220(0.240)	44.097(0.258)
128	86.637(0.210)	88.077(0.626)
256	166.580(0.359)	169.853(1.018)
512	311.783(7.760)	321.183(1.204)
1024	574.277(5.231)	588.377(17.137)
2048	1146.297(1.776)	1183.527(15.046)
4096	2243.013(17.998)	2274.647(19.084)
8192	4052.420(15.792)	4099.797(6.380)
16384	5485.043(43.018)	5490.327(102.710)
32768	6970.863(91.388)	6989.997(66.010)
65536	7978.243(260.555)	7727.900(128.444)
131072	7863.903(228.957)	7851.770(69.654)
262144	8393.657(14.863)	8463.930(68.671)
524288	7809.413(57.062)	7878.497(39.805)
1048576	7597.517(11.825)	7632.283(15.187)
2097152	7409.633(31.589)	7415.340(8.892)
4194304	7206.303(5.963)	7190.293(15.232)

data for allreduce (1152 ranks)

messsage size (byte)	runtime of master branch (us)	runtime of PR (us)
4	376.920(6.993)	358.887(1.267)
8	375.193(4.903)	362.313(6.407)
16	379.043(4.907)	367.320(2.416)
32	378.983(3.327)	365.033(0.025)
64	379.040(2.325)	370.940(8.364)
128	389.123(6.489)	382.817(6.643)
256	382.267(4.264)	366.070(2.691)
512	390.613(5.006)	375.397(6.106)
1024	400.167(9.363)	382.817(2.488)
2048	432.973(49.838)	387.823(5.430)
4096	465.160(66.191)	417.010(2.587)
8192	484.983(5.152)	465.047(1.867)
16384	845.573(4.358)	847.223(41.713)
32768	882.293(3.700)	862.260(7.083)
65536	975.670(58.772)	907.817(3.435)
131072	1066.440(19.032)	1031.603(12.494)
262144	1449.860(9.174)	1420.890(8.808)
524288	2300.900(129.546)	2241.980(32.615)
1048576	3965.573(40.579)	3899.763(29.497)

Data for allgather (1152 ranks)

messsage size (byte)	runtime of master branch (us)	runtime of PR (us)
1	425.447(4.432)	406.455(2.694)
2	434.780(17.089)	413.325(1.605)
4	426.730(1.569)	409.230(1.640)
8	443.733(16.541)	414.485(1.082)
16	464.647(3.406)	439.060(8.047)
32	500.337(4.117)	483.250(1.612)
64	607.377(0.766)	591.445(6.852)
128	966.753(1.692)	957.025(3.967)
256	1564.667(2.786)	1554.670(9.914)
512	2872.187(19.323)	2850.595(20.004)
1024	5955.720(69.344)	6190.235(249.121)
2048	12713.397(113.910)	13065.555(420.608)
4096	18586.457(91.567)	18599.110(7.821)
8192	22954.580(191.734)	22555.385(154.722)
16384	33934.100(429.840)	33016.585(340.055)
32768	90881.490(4.837)	90116.695(329.349)
65536	119714.870(411.499)	120449.680(978.268)
131072	176020.830(647.948)	177742.485(657.136)
262144	295092.300(1978.021)	299041.975(2652.039)

wzamazon requested review from bosilca, rhc54 and bwbarrett October 6, 2021 22:03

wzamazon mentioned this pull request Oct 6, 2021

ompi-tests/oneside/test_start2 hang with large group size on ompi master #9447

Closed

rhc54 approved these changes Oct 7, 2021

View reviewed changes

bosilca reviewed Oct 7, 2021

View reviewed changes

wzamazon force-pushed the evloop_once branch from 7945b19 to 3d794f9 Compare October 7, 2021 18:08

rhc54 approved these changes Oct 7, 2021

View reviewed changes

wzamazon requested a review from bosilca October 7, 2021 18:11

bosilca reviewed Oct 7, 2021

View reviewed changes

wzamazon force-pushed the evloop_once branch from 3d794f9 to f22d897 Compare October 7, 2021 22:38

bosilca approved these changes Oct 8, 2021

View reviewed changes

bosilca merged commit 5257cc3 into open-mpi:master Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

wzamazon commented Oct 6, 2021

rhc54 left a comment

bosilca Oct 7, 2021

wzamazon Oct 7, 2021

bosilca Oct 7, 2021

wzamazon Oct 7, 2021

wzamazon commented Oct 7, 2021

bosilca Oct 7, 2021

wzamazon Oct 8, 2021

bosilca commented Oct 7, 2021

wzamazon commented Oct 8, 2021

bosilca commented Oct 8, 2021

wzamazon commented Oct 8, 2021

ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

Conversation

wzamazon commented Oct 6, 2021

rhc54 left a comment

Choose a reason for hiding this comment

bosilca Oct 7, 2021

Choose a reason for hiding this comment

wzamazon Oct 7, 2021

Choose a reason for hiding this comment

bosilca Oct 7, 2021

Choose a reason for hiding this comment

wzamazon Oct 7, 2021

Choose a reason for hiding this comment

wzamazon commented Oct 7, 2021

bosilca Oct 7, 2021

Choose a reason for hiding this comment

wzamazon Oct 8, 2021

Choose a reason for hiding this comment

bosilca commented Oct 7, 2021

wzamazon commented Oct 8, 2021

bosilca commented Oct 8, 2021

wzamazon commented Oct 8, 2021