|
| 1 | +Tuning Collectives |
| 2 | +================== |
| 3 | + |
| 4 | +Open MPI's ``coll`` framework provides a number of components implementing |
| 5 | +collective communication, including: ``han``, ``libnbc``, ``self``, ``ucc`` ``base``, |
| 6 | +``hcoll``, ``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``, |
| 7 | +and ``tuned``. Some of these components may not be available depending on how |
| 8 | +Open MPI was compiled and what hardware is available on the system. A run-time |
| 9 | +decision based on each component's self reported priority, selects which |
| 10 | +component will be used. These priorities may be adjusted on the command line |
| 11 | +or with any of the other usual ways of setting MCA variables, giving us a way |
| 12 | +to influence or override component selection. In the end, which of the |
| 13 | +available components is selected depends on a number of factors such as the |
| 14 | +underlying hardware and the whether or not a specific collective is provided by |
| 15 | +the component as not all components implement all collectives. However, there |
| 16 | +is always a fallback ``base`` component that steps in and takes over when another |
| 17 | +component fails to provide an implementation. |
| 18 | + |
| 19 | +The remainder of this section describes the tuning options available in the |
| 20 | +``tuned`` collective component. |
| 21 | + |
| 22 | +Fixed, Forced, and Dynamic Decisions |
| 23 | +------------------------------------ |
| 24 | +Open MPI's ``tuned`` collective component has three modes of operation, *fixed |
| 25 | +decision*, *forced algorithm*, and *dynamic decision*. Since different |
| 26 | +collective algorithms perform better in different situations, the purpose of |
| 27 | +these modes is, when a collective is called, to select an appropriate |
| 28 | +algorithm. |
| 29 | + |
| 30 | +In the *fixed decision* mode a decision tree, essentially a large set of nested |
| 31 | +if-then-else-if blocks with baked in comm and message size thresholds derived |
| 32 | +by measuring performance on existing clusters, selects one of a number of |
| 33 | +available algorithms for the specific collective being called. *Fixed |
| 34 | +decision* is the default. The *fixed decision* mode can work well if your |
| 35 | +cluster hardware is similar to the systems profiled when constructing the |
| 36 | +decision tree. However, in some cases the *fixed decision* rules can yield |
| 37 | +less than ideal performance. For instance, when your hardware differs |
| 38 | +substantially from that which was used to derive the fixed rules decision |
| 39 | +tree. |
| 40 | + |
| 41 | +In the *dynamic decision* mode a user can provide a set of rules encoded in an |
| 42 | +ASCII file that tell the ``tuned`` component which algorithm should be for one or |
| 43 | +more collectives as a function of communicator and message size. For the |
| 44 | +collectives which rules are not provided the *fixed decision* mode is used. |
| 45 | +More detail about the dynamic decision mode and rules file can be found in |
| 46 | +section :ref:`Rules File<RulesFile>`. |
| 47 | + |
| 48 | +In the *forced algorithm* mode one can specify an algorithm for a specific |
| 49 | +collective on the command line. This is discussed in detail in section |
| 50 | +:ref:`Forcing an Algorithm<ForceAlg>`. |
| 51 | + |
| 52 | +.. _ForceAlg: |
| 53 | + |
| 54 | +Forcing an Algorithm |
| 55 | +-------------------- |
| 56 | + |
| 57 | +The simplest method of tuning is to force the ``tuned`` collective component to |
| 58 | +use a specific algorithm for all calls to a specific collective. This can be |
| 59 | +done on the command line, or by any of the other usual ways of setting MCA |
| 60 | +variables. For example, the following command line sets the algorithm to |
| 61 | +linear for all MPI_Alltoall calls in the entire run. |
| 62 | + |
| 63 | +.. code-block:: sh |
| 64 | +
|
| 65 | + shell$ mpirun ... --mca coll_tuned_use_dynamic_rules 1 \ |
| 66 | + --mca coll_tuned_alltoall_algorithm 1 ... |
| 67 | +
|
| 68 | +When an algorithm is selected this way, the *fixed decision* rules associated |
| 69 | +with that collective are short circuited. This is an easy way to select a |
| 70 | +collective algorithm and is most effective when there is a single communicator |
| 71 | +and message sizes are static with in the run. On the other hand, when the |
| 72 | +message size for a given collective varies during the run, or there are |
| 73 | +multiple communicators of different sizes, the effectiveness of this approach |
| 74 | +may be limited. The :ref:`rules file <RulesFile>` provides a more comprehensive |
| 75 | +and flexible approach to tuning. For collectives where an algorithm is not |
| 76 | +forced on the command line, the *fixed decision* rules are used. |
| 77 | + |
| 78 | +.. _RulesFile: |
| 79 | + |
| 80 | +Dynamic Decisions and the Rules File |
| 81 | +------------------------------------ |
| 82 | + |
| 83 | +Given that the best choice of algorithm for a given collective depends on a |
| 84 | +number of factors only known at run time, and that some of these factors may |
| 85 | +vary with in a run, setting an algorithm on the command line often is an |
| 86 | +ineffective means of tuning. The rules file provides a means of choosing |
| 87 | +an algorithm at run time based on communicator and message size. The rules |
| 88 | +file can be specified on the command line, or the other usual ways to set MCA |
| 89 | +variables. The file is parsed during initialization and takes affect there |
| 90 | +after. |
| 91 | + |
| 92 | +.. code-block:: sh |
| 93 | +
|
| 94 | + shell$ mpirun ... --mca coll_tuned_use_dynamic_rules 1 \ |
| 95 | + --mca coll_tuned_dynamic_rules_filename /path/to/my_rules.conf ... |
| 96 | +
|
| 97 | +The loaded set of rules then are used to select the algorithm |
| 98 | +to use based on the collective, the communicator size, and the message size. |
| 99 | +Collectives for which rules have not be specified in the file will make use of |
| 100 | +the *fixed decision* rules as usual. |
| 101 | + |
| 102 | +Dynamic tuning files are organized in this format: |
| 103 | + |
| 104 | +.. code-block:: sh |
| 105 | + :linenos: |
| 106 | +
|
| 107 | + 1 # Number of collectives |
| 108 | + 1 # Collective ID |
| 109 | + 1 # Number of comm sizes |
| 110 | + 2 # Comm size |
| 111 | + 2 # Number of message sizes |
| 112 | + 0 1 0 0 # Message size 0, algorithm 1, topo and segmentation at 0 |
| 113 | + 1024 2 0 0 # Message size 1024, algorithm 2, topo and segmentation at 0 |
| 114 | +
|
| 115 | +The rules file effectively defines, for one or more collectives, a function of |
| 116 | +two variables, which given communicator and message size, returns an algorithm |
| 117 | +id to use for the call. This mechanism allows one to specify for each |
| 118 | +collective, an algorithm for any number of ranges of message and communicator |
| 119 | +sizes. As communicators are constructed, a search of the rules table is made |
| 120 | +using the communicator size to select a set of message size rules to be used |
| 121 | +with that communicator. Later as the collective is invoked, a search of the |
| 122 | +message size rules associated with the communicator is made. The rule with the |
| 123 | +nearest (less than) matching message size specifies the algorithm that is used. |
| 124 | +The actual definition of *message size* is dependent on the collective in |
| 125 | +question, see the section on :ref:`collectives and algorithms<CollectivesAndAlgorithms>` |
| 126 | +for details. |
| 127 | + |
| 128 | +One may provide rules for as many collectives, communicator sizes, and message |
| 129 | +sizes as desired. Simply repeat the sections as needed and adjust the relevant |
| 130 | +count parameters. One must always provide a rule for message size of zero. |
| 131 | +Message size rules are expected in ascending order. The last two parameters in |
| 132 | +the rule may or may not be used and have different meaning depending on the |
| 133 | +collective and algorithm. As of writing not all of the relevant control |
| 134 | +parameters can be set by the rules file (See issue #12589). |
| 135 | + |
| 136 | +.. _CollectivesAndAlgorithms: |
| 137 | + |
| 138 | +Collectives and their Algorithms |
| 139 | +-------------------------------- |
| 140 | +The following table lists the collectives |
| 141 | +implemented by the ``tuned`` collective component along with the enumeration value identifying it. |
| 142 | +It is this value that must be used in the rules file when specifying a set of rules. |
| 143 | +The definition of *message size* is dependent on the collective and is given in the table. |
| 144 | +Tables describing the algorithms available for each collective, and there identifiers, are linked. |
| 145 | + |
| 146 | +.. csv-table:: Collectives |
| 147 | + :header: "Collective", "Id", "Message Size" |
| 148 | + :widths: 20, 10, 65 |
| 149 | + |
| 150 | + :ref:`Allgather<Allgather>`, 0, "datatype size * comm size * number of elements in send buffer" |
| 151 | + :ref:`Allgatherv<Allgatherv>`, 1, "datatype size * sum of number of elements that are to be received from each process (sum of recvcounts)" |
| 152 | + :ref:`Allreduce<Allreduce>`, 2, "datatype size * number of elements in send buffer" |
| 153 | + :ref:`Alltoall<Alltoall>`, 3, "datatype size * comm size * number of elements to send to each process" |
| 154 | + :ref:`Alltoallv<Alltoallv>`, 4, "not used" |
| 155 | + :ref:`Barrier<Barrier>`, 6, "not used" |
| 156 | + :ref:`Bcast<Bcast>`, 7, "datatype size * number of entries in buffer" |
| 157 | + :ref:`Exscan<Exscan>`, 8, "datatype size * comm size" |
| 158 | + :ref:`Gather<Gather>`, 9, "datatype size * comm size * number of elements in send buffer" |
| 159 | + :ref:`Reduce<Reduce>`, 11, "datatype size * number of elements in send buffer" |
| 160 | + :ref:`Reduce_scatter<Reduce_scatter>`, 12, "datatype size * sum of number of elements in result distributed to each process (sum of recvcounts)" |
| 161 | + :ref:`Reduce_scatter_block<Reduce_scatter_block>`, 13, "datatype size * comm size * element count per block" |
| 162 | + :ref:`Scan<Scan>`, 14, "datatype size * comm size" |
| 163 | + :ref:`Scatter<Scatter>`, 15, "datatype size * number of elements in send buffer" |
| 164 | + |
| 165 | +.. _Allgather: |
| 166 | + |
| 167 | +Allgather (Id=0) |
| 168 | +~~~~~~~~~~~~~~~~ |
| 169 | + |
| 170 | +.. csv-table:: Allgather Algorithms |
| 171 | + :header: "Id", "Name", "Description" |
| 172 | + :widths: 10, 25, 60 |
| 173 | + |
| 174 | + 0, "ignore", "Use fixed rules" |
| 175 | + 1, "linear", "..." |
| 176 | + 2, "bruck-k-fanout", "..." |
| 177 | + 3, "recursive_doubling", "..." |
| 178 | + 4, "ring", "..." |
| 179 | + 5, "neighbor", "..." |
| 180 | + 6, "two_proc", "..." |
| 181 | + 7, "sparbit", "..." |
| 182 | + 8, "direct-messaging", "..." |
| 183 | + |
| 184 | +.. _Allgatherv: |
| 185 | + |
| 186 | +Allgatherv (Id=1) |
| 187 | +~~~~~~~~~~~~~~~~~ |
| 188 | + |
| 189 | +.. csv-table:: Allgatherv Algorithms |
| 190 | + :header: "Id", "Name", "Description" |
| 191 | + :widths: 10, 25, 60 |
| 192 | + |
| 193 | + 0, "ignore", "Use fixed rules" |
| 194 | + 1, "default", "..." |
| 195 | + 2, "bruck", "..." |
| 196 | + 3, "ring", "..." |
| 197 | + 4, "neighbor", "..." |
| 198 | + 5, "two_proc", "..." |
| 199 | + 6, "sparbit", "..." |
| 200 | + |
| 201 | +.. _Allreduce: |
| 202 | + |
| 203 | +Allreduce (Id=2) |
| 204 | +~~~~~~~~~~~~~~~~ |
| 205 | + |
| 206 | +.. csv-table:: Allreduce Algorithms |
| 207 | + :header: "Id", "Name", "Description" |
| 208 | + :widths: 10, 25, 60 |
| 209 | + |
| 210 | + 0, "ignore", "Use fixed rules" |
| 211 | + 1, "basic_linear", "..." |
| 212 | + 2, "nonoverlapping", "..." |
| 213 | + 3, "recursive_doubling", "..." |
| 214 | + 4, "ring", "..." |
| 215 | + 5, "segmented_ring", "..." |
| 216 | + 6, "rabenseifner", "..." |
| 217 | + 7, "allgather_reduce", "..." |
| 218 | + |
| 219 | +.. _Alltoall: |
| 220 | + |
| 221 | +Alltoall (Id=3) |
| 222 | +~~~~~~~~~~~~~~~ |
| 223 | + |
| 224 | +.. csv-table:: Alltoall Algorithms |
| 225 | + :header: "Id", "Name", "Description" |
| 226 | + :widths: 10, 25, 60 |
| 227 | + |
| 228 | + 0, "ignore", "Use fixed rules" |
| 229 | + 1, "linear", "Launches all non-blocking send/recv pairs and wait for them to complete." |
| 230 | + 2, "pairwise", "For comm size P, implemented as P rounds of blocking MPI_Sendrecv" |
| 231 | + 3, "modified_bruck", "An algorithm exploiting network packet quantization to achieve O(log) time complexity. Typically best for very small message sizes." |
| 232 | + 4, "linear_sync", "Keep N non-blocking MPI_Isend/Irecv pairs in flight at all times. N is set by the coll_tuned_alltoall_max_requests MCA variable." |
| 233 | + 5, "two_proc", "An implementation tailored for alltoall between 2 ranks, otherwise it is not used." |
| 234 | + |
| 235 | +.. _Alltoallv: |
| 236 | + |
| 237 | +Alltoallv (Id=4) |
| 238 | +~~~~~~~~~~~~~~~~ |
| 239 | + |
| 240 | +.. csv-table:: Alltoallv Algorithms |
| 241 | + :header: "Id", "Name", "Description" |
| 242 | + :widths: 10, 25, 60 |
| 243 | + |
| 244 | + 0, "ignore", "Use fixed rules" |
| 245 | + 1, "basic_linear", "..." |
| 246 | + 2, "pairwise", "..." |
| 247 | + |
| 248 | +.. _Barrier: |
| 249 | + |
| 250 | +Barrier (Id=6) |
| 251 | +~~~~~~~~~~~~~~ |
| 252 | + |
| 253 | +.. csv-table:: Barrier Algorithms |
| 254 | + :header: "Id", "Name", "Description" |
| 255 | + :widths: 10, 25, 60 |
| 256 | + |
| 257 | + 0, "ignore", "Use fixed rules" |
| 258 | + 1, "linear", "..." |
| 259 | + 2, "double_ring", "..." |
| 260 | + 3, "recursive_doubling", "..." |
| 261 | + 4, "bruck", "..." |
| 262 | + 5, "two_proc", "..." |
| 263 | + 6, "tree", "..." |
| 264 | + |
| 265 | +.. _Bcast: |
| 266 | + |
| 267 | +Bcast (Id=7) |
| 268 | +~~~~~~~~~~~~ |
| 269 | + |
| 270 | +.. csv-table:: Bcast Algorithms |
| 271 | + :header: "Id", "Name", "Description" |
| 272 | + :widths: 10, 25, 60 |
| 273 | + |
| 274 | + 0, "ignore", "Use fixed rules" |
| 275 | + 1, "basic_linear", "..." |
| 276 | + 2, "chain", "..." |
| 277 | + 3, "pipeline", "..." |
| 278 | + 4, "split_binary_tree", "..." |
| 279 | + 5, "binary_tree", "..." |
| 280 | + 6, "binomial", "..." |
| 281 | + 7, "knomial", "..." |
| 282 | + 8, "scatter_allgather", "..." |
| 283 | + 9, "scatter_allgather_ring", "..." |
| 284 | + |
| 285 | +.. _Exscan: |
| 286 | + |
| 287 | +Exscan (Id=8) |
| 288 | +~~~~~~~~~~~~~ |
| 289 | + |
| 290 | +.. csv-table:: Exscan Algorithms |
| 291 | + :header: "Id", "Name", "Description" |
| 292 | + :widths: 10, 25, 60 |
| 293 | + |
| 294 | + 0, "ignore", "Use fixed rules" |
| 295 | + 1, "linear", "..." |
| 296 | + 2, "recursive_doubling", "..." |
| 297 | + |
| 298 | +.. _Gather: |
| 299 | + |
| 300 | +Gather (Id=9) |
| 301 | +~~~~~~~~~~~~~ |
| 302 | + |
| 303 | +.. csv-table:: Gather Algorithms |
| 304 | + :header: "Id", "Name", "Description" |
| 305 | + :widths: 10, 25, 60 |
| 306 | + |
| 307 | + 0, "ignore", "Use fixed rules" |
| 308 | + 1, "basic_linear", "..." |
| 309 | + 2, "binomial", "..." |
| 310 | + 3, "linear_sync", "..." |
| 311 | + |
| 312 | +.. _Reduce: |
| 313 | + |
| 314 | +Reduce (Id=11) |
| 315 | +~~~~~~~~~~~~~~ |
| 316 | + |
| 317 | +.. csv-table:: Reduce Algorithms |
| 318 | + :header: "Id", "Name", "Description" |
| 319 | + :widths: 10, 25, 60 |
| 320 | + |
| 321 | + 0, "ignore", "Use fixed rules" |
| 322 | + 1, "linear", "..." |
| 323 | + 2, "chain", "..." |
| 324 | + 3, "pipeline", "..." |
| 325 | + 4, "binary", "..." |
| 326 | + 5, "binomial", "..." |
| 327 | + 6, "in-order_binary", "..." |
| 328 | + 7, "rabenseifner", "..." |
| 329 | + 8, "knomial", "..." |
| 330 | + |
| 331 | +.. _Reduce_scatter: |
| 332 | + |
| 333 | +Reduce_scatter (Id=12) |
| 334 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 335 | + |
| 336 | +.. csv-table:: Reduce_scatter Algorithms |
| 337 | + :header: "Id", "Name", "Description" |
| 338 | + :widths: 10, 25, 60 |
| 339 | + |
| 340 | + 0, "ignore", "Use fixed rules" |
| 341 | + 1, "non-overlapping", "..." |
| 342 | + 2, "recursive_halving", "..." |
| 343 | + 3, "ring", "..." |
| 344 | + 4, "butterfly", "..." |
| 345 | + |
| 346 | +.. _Reduce_scatter_block: |
| 347 | + |
| 348 | +Reduce_scatter_block (Id=13) |
| 349 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 350 | + |
| 351 | +.. csv-table:: Reduce_scatter_block Algorithms |
| 352 | + :header: "Id", "Name", "Description" |
| 353 | + :widths: 10, 25, 60 |
| 354 | + |
| 355 | + 0, "ignore", "Use fixed rules" |
| 356 | + 1, "basic_linear", "..." |
| 357 | + 2, "recursive_doubling", "..." |
| 358 | + 3, "recursive_halving", "..." |
| 359 | + 4, "butterfly", "..." |
| 360 | + |
| 361 | +.. _Scan: |
| 362 | + |
| 363 | +Scan (Id=14) |
| 364 | +~~~~~~~~~~~~ |
| 365 | + |
| 366 | +.. csv-table:: Scan Algorithms |
| 367 | + :header: "Id", "Name", "Description" |
| 368 | + :widths: 10, 25, 60 |
| 369 | + |
| 370 | + 0, "ignore", "Use fixed rules" |
| 371 | + 1, "linear", "..." |
| 372 | + 2, "recursive_doubling", "..." |
| 373 | + |
| 374 | +.. _Scatter: |
| 375 | + |
| 376 | +Scatter (Id=15) |
| 377 | +~~~~~~~~~~~~~~~ |
| 378 | + |
| 379 | +.. csv-table:: Scatter Algorithms |
| 380 | + :header: "Id", "Name", "Description" |
| 381 | + :widths: 10, 25, 60 |
| 382 | + |
| 383 | + 0, "ignore", "Use fixed rules" |
| 384 | + 1, "basic_linear", "..." |
| 385 | + 2, "binomial", "..." |
| 386 | + 3, "linear_nb", "..." |
| 387 | + |
0 commit comments