Skip to content

ompi,opal/progress: preserve the OPAL_EVLOOP_ONCE flag for opal_event_loop #9480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 8, 2021

Conversation

wzamazon
Copy link
Contributor

@wzamazon wzamazon commented Oct 6, 2021

addresses #9447

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me

@@ -999,7 +999,7 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided,
CPU utilization for the remainder of MPI_INIT when we are
blocking on RTE-level events, but may greatly reduce non-TCP
latency. */
opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK);
opal_progress_add_event_flag(OPAL_EVLOOP_NONBLOCK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opal_progress_add_event_flag returns the old flag. I would prefer not to add a second function but instead do

  int old_flag = opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK);
  if( !(old_flag & OPAL_EVLOOP_NONBLOCK) ) {
      opal_progress_set_event_flag( old_flag | OPAL_EVLOOP_NONBLOCK );
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think this is a great idea! I removed the commit the introduced the new function, and implemented the commit as you suggested (with minor adjustment).

* OPAL_EVLOOP_NONBLOCK | OPAL_EVLOOP_ONCE
*
* OPAL_EVLOOP_NONBLOCK means opal_event_loop() should NOT block on
* waiting for active events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this on the issue, and this is not exactly what this flag is about, because for as long as there are new events the opal_progress never returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I adjusted the comment in the new revision. The point I wanted to make is that this flag only impact the behavior when there is no active events.

The comment for opal_progress_set_event_flags states that default
opal_progress_event_flags is OPAL_EVLOOP_ONELOOP, which is not
accurate, because OPAL_EVLOOP_ONELOOP has been removed in commit
33c3b71, and was replaced by
OPAL_EVLOOP_NONBLOCK | OPAL_EVLOOP_ONCE.

This patch fixed the comment about the default argument.

It also fixed a typo in the argument section: "vlags" -> "flags"

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon wzamazon requested a review from bosilca October 7, 2021 18:11
@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 7, 2021

I ran the following test to verify the patch:

intel mpi benchmark IMB-MPI1 (with btl/tcp).

osu micro benchmark (latency, bw, mbw_mr)

applications like lammps (with the LJ, EAM test case) and hpcg.

@@ -393,7 +393,7 @@ static void evhandler_reg_callbk(pmix_status_t status,
int ompi_mpi_init(int argc, char **argv, int requested, int *provided,
bool reinit_ok)
{
int ret;
int ret, old_event_flags;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to compiler warnings when OPAL_ENABLE_PROGRESS_THREADS is not 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new revision.

@bosilca
Copy link
Member

bosilca commented Oct 7, 2021

The comment indicates that this was done with the goal of improving the latency for non TCP cases. The only way for this to affect the latency would be in a setup with multiple recv posted. Did you notice any performance impact with your test ?

Currently, ompi_mpi_init() call

     opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK),

whith the intention to ensure OPAL_EVLOOP_NONBLOCK is set
in opal_progress_event_flag.

However, this call will remove other existing flags
(like OPAL_EVLOOP_ONCE) in opal_progress_event_flag,
which can cause deadlock.

This patch address the issue by adding OPAL_EVLOOP_NONBLOCK
to that flag.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 8, 2021

@bosilca

I tested was using btl/tcp using osu_latency, osu_bw, osu_allreduce (1152 ranks) and osu_allgather (1152 ranks)

For osu_latency and osu_bw, the performance difference are with in 2 standard deviation.

For osu_allreduce and osu_allgather, the change improved performance for some small message sizes (difference > 2 standard deviation). For other messages, the difference is again within 2 standard deviation.

The following is osu_latency result. (All data is 3 time average, the number in parenthesis is standard deviation).

message size (byte) latency master (us) latency change (us)
0 25.997(0.831) 26.297(1.453)
1 26.077(0.861) 26.353(1.438)
2 26.073(0.838) 26.393(1.398)
4 26.103(0.866) 26.363(1.456)
8 26.267(0.926) 26.507(1.437)
16 26.700(0.837) 26.987(1.437)
32 26.723(0.848) 26.980(1.408)
64 26.803(0.871) 27.090(1.446)
128 26.920(0.885) 27.217(1.455)
256 26.887(0.818) 27.113(1.301)
512 27.073(0.831) 27.283(1.382)
1024 27.517(0.806) 27.727(1.358)
2048 28.880(0.875) 29.133(1.493)
4096 32.017(0.900) 32.370(1.596)
8192 36.247(0.922) 36.653(1.598)
16384 43.527(0.549) 44.197(1.010)
32768 62.220(1.135) 63.667(1.809)
65536 144.183(2.560) 148.737(4.473)
131072 160.163(3.794) 165.380(5.524)
262144 207.117(4.628) 213.630(8.208)
524288 308.577(2.488) 313.717(3.570)
1048576 510.897(2.138) 521.623(3.471)
2097152 945.130(5.538) 957.323(8.837)
4194304 2479.843(5.042) 2494.190(4.837)

The following is osu_bw result

message size master bw (MB/s) PR bw (MB/s)
1 0.420(0.000) 0.440(0.017)
2 0.843(0.012) 0.883(0.025)
4 1.707(0.032) 1.773(0.049)
8 3.397(0.078) 3.557(0.055)
16 6.923(0.240) 7.300(0.121)
32 12.883(0.159) 13.317(0.803)
64 25.960(0.411) 26.843(1.609)
128 51.387(0.405) 53.673(2.943)
256 100.690(0.159) 104.640(4.947)
512 196.107(1.101) 203.910(8.602)
1024 355.663(3.354) 369.007(13.903)
2048 633.607(1.240) 652.680(27.496)
4096 1045.380(10.372) 1098.347(61.761)
8192 1187.243(0.316) 1187.500(0.279)
16384 1189.383(0.374) 1190.260(0.575)
32768 1190.400(0.287) 1189.970(0.428)
65536 1189.460(0.026) 1189.320(0.092)
131072 1190.247(0.090) 1190.247(0.029)
262144 1190.743(0.035) 1190.777(0.012)
524288 1190.973(0.012) 1190.963(0.006)
1048576 1191.090(0.000) 1191.090(0.000)
2097152 1191.150(0.000) 1191.150(0.000)
4194304 1191.180(0.000) 1191.180(0.000)

The following is osu_allreduce result (1152 ranks)

Message Size (byte) runtime of master (us) runtime with change (us)
4 367.493(3.056) 362.583(7.996)
8 372.257(3.376) 359.910(2.336) *
16 376.060(6.276) 366.697(3.107) *
32 382.667(8.927) 366.840(4.064) *
64 381.560(2.428) 365.317(2.168) *
128 378.713(4.709) 368.447(5.622) *
256 381.547(5.734) 366.250(3.176) *
512 390.820(9.925) 378.867(15.143)
1024 390.630(2.761) 404.360(46.346)
2048 406.670(3.026) 413.597(51.617)
4096 428.060(3.152) 429.250(17.812)
8192 482.337(10.986) 464.437(2.115)
16384 849.813(6.832) 833.370(18.671)
32768 886.503(6.343) 863.500(11.619)
65536 956.223(34.957) 936.930(41.237)
131072 1056.793(10.053) 1028.327(3.336)
262144 1443.797(8.492) 1426.460(8.382)
524288 2261.233(27.292) 2261.187(20.286)
1048576 3957.807(23.805) 4086.620(259.036)

The following is osu_allgather result (1152 ranks)

Message Size (byte) runtime of master (us) runtime with change (us)
1 427.737(9.860) 416.587(13.676)
2 463.977(59.748) 409.407(2.984)
4 471.393(58.288) 472.467(106.369)
8 450.167(19.312) 443.510(21.062)
16 448.740(5.359) 431.650(0.740)
32 500.977(4.160) 483.120(1.084) *
64 605.590(2.843) 588.547(1.729) *
128 965.243(4.690) 950.857(5.975)
256 1556.810(4.089) 1541.183(7.976)
512 2825.667(10.573) 2819.557(16.394)
1024 5861.893(194.741) 5837.000(222.494)
2048 13155.333(954.938) 12828.160(108.980)
4096 18867.147(209.805) 18402.710(51.116)
8192 23308.963(577.618) 22562.967(297.951)
16384 34344.003(1185.396) 33130.357(729.296)
32768 90554.647(876.167) 90038.390(709.740)
65536 120246.920(1905.017) 119608.250(1074.850)
131072 176621.410(3097.940) 176961.527(1877.809)
262144 297212.507(6643.612) 296812.370(3248.363)

I will run same tests on btl/ofi (on EFA) to get data for non-tcp cases

@bosilca
Copy link
Member

bosilca commented Oct 8, 2021

The patch looks good. Let's wait for the performance results with ofi before merging this.

@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 8, 2021

I ran the same latency, bw, allreduce (of 1152 ranks), allgather (of 1152 ranks) test on mtl/ofi, the conclusion is similiar to that for btl/tcp:

  1. No significant performance difference shown on osu_latency and osu_bw
  2. No signficant performance difference on osu_allreduce
  3. For some small message size, osu_allgather show minor improvement with the patch (1 to 32 bytes).

The follow is the data:

osu_latency, data show in the table is 3 run average, number in the parenthesis is the standard deviation:

messsage size (byte) latency of master branch (us) latency of PR (us)
0 20.040(0.026) 19.440(0.072)
1 20.033(0.012) 19.377(0.015)
2 19.987(0.029) 19.370(0.026)
4 19.987(0.023) 19.347(0.021)
8 20.000(0.026) 19.353(0.031)
16 19.993(0.015) 19.347(0.031)
32 20.027(0.006) 19.363(0.015)
64 20.060(0.036) 19.400(0.044)
128 20.090(0.017) 19.430(0.020)
256 20.163(0.015) 19.487(0.032)
512 20.303(0.042) 19.683(0.031)
1024 20.733(0.006) 20.153(0.015)
2048 21.733(0.012) 21.107(0.012)
4096 23.920(0.050) 23.283(0.012)
8192 28.703(0.050) 28.220(0.050)
16384 31.323(0.081) 30.787(0.244)
32768 35.933(0.429) 35.340(0.602)
65536 43.090(0.383) 42.353(0.864)
131072 100.943(0.762) 99.220(1.897)
262144 139.983(1.858) 139.600(2.843)
524288 248.917(3.746) 247.977(6.629)
1048576 454.467(7.285) 454.170(12.318)
2097152 900.713(18.564) 898.767(9.322)
4194304 1719.867(34.016) 1685.350(23.085)

osu_bw, data show in the table is 3 run average, number in the parenthesis is the standard deviation:

messsage size (byte) bw of master branch (MB/s) bwof PR (MB/s)
1 0.667(0.006) 0.673(0.006)
2 1.337(0.006) 1.360(0.010)
4 2.703(0.012) 2.757(0.015)
8 5.413(0.025) 5.520(0.026)
16 10.837(0.042) 11.033(0.080)
32 21.653(0.095) 22.103(0.133)
64 43.220(0.240) 44.097(0.258)
128 86.637(0.210) 88.077(0.626)
256 166.580(0.359) 169.853(1.018)
512 311.783(7.760) 321.183(1.204)
1024 574.277(5.231) 588.377(17.137)
2048 1146.297(1.776) 1183.527(15.046)
4096 2243.013(17.998) 2274.647(19.084)
8192 4052.420(15.792) 4099.797(6.380)
16384 5485.043(43.018) 5490.327(102.710)
32768 6970.863(91.388) 6989.997(66.010)
65536 7978.243(260.555) 7727.900(128.444)
131072 7863.903(228.957) 7851.770(69.654)
262144 8393.657(14.863) 8463.930(68.671)
524288 7809.413(57.062) 7878.497(39.805)
1048576 7597.517(11.825) 7632.283(15.187)
2097152 7409.633(31.589) 7415.340(8.892)
4194304 7206.303(5.963) 7190.293(15.232)

data for allreduce (1152 ranks)

messsage size (byte) runtime of master branch (us) runtime of PR (us)
4 376.920(6.993) 358.887(1.267)
8 375.193(4.903) 362.313(6.407)
16 379.043(4.907) 367.320(2.416)
32 378.983(3.327) 365.033(0.025)
64 379.040(2.325) 370.940(8.364)
128 389.123(6.489) 382.817(6.643)
256 382.267(4.264) 366.070(2.691)
512 390.613(5.006) 375.397(6.106)
1024 400.167(9.363) 382.817(2.488)
2048 432.973(49.838) 387.823(5.430)
4096 465.160(66.191) 417.010(2.587)
8192 484.983(5.152) 465.047(1.867)
16384 845.573(4.358) 847.223(41.713)
32768 882.293(3.700) 862.260(7.083)
65536 975.670(58.772) 907.817(3.435)
131072 1066.440(19.032) 1031.603(12.494)
262144 1449.860(9.174) 1420.890(8.808)
524288 2300.900(129.546) 2241.980(32.615)
1048576 3965.573(40.579) 3899.763(29.497)

Data for allgather (1152 ranks)

messsage size (byte) runtime of master branch (us) runtime of PR (us)
1 425.447(4.432) 406.455(2.694)
2 434.780(17.089) 413.325(1.605)
4 426.730(1.569) 409.230(1.640)
8 443.733(16.541) 414.485(1.082)
16 464.647(3.406) 439.060(8.047)
32 500.337(4.117) 483.250(1.612)
64 607.377(0.766) 591.445(6.852)
128 966.753(1.692) 957.025(3.967)
256 1564.667(2.786) 1554.670(9.914)
512 2872.187(19.323) 2850.595(20.004)
1024 5955.720(69.344) 6190.235(249.121)
2048 12713.397(113.910) 13065.555(420.608)
4096 18586.457(91.567) 18599.110(7.821)
8192 22954.580(191.734) 22555.385(154.722)
16384 33934.100(429.840) 33016.585(340.055)
32768 90881.490(4.837) 90116.695(329.349)
65536 119714.870(411.499) 120449.680(978.268)
131072 176020.830(647.948) 177742.485(657.136)
262144 295092.300(1978.021) 299041.975(2652.039)

@bosilca bosilca merged commit 5257cc3 into open-mpi:master Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants