Skip to content

Commit 2e21c2e

Browse files
authored
Merge pull request #12665 from jsquyres/pr/v5.0.x/coll-tuned-doc-tuning
v5.0.x: improve documentation for the coll tuned component
2 parents 580c452 + 1c29777 commit 2e21c2e

File tree

2 files changed

+388
-0
lines changed

2 files changed

+388
-0
lines changed

docs/tuning-apps/coll-tuned.rst

Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
Tuning Collectives
2+
==================
3+
4+
Open MPI's ``coll`` framework provides a number of components implementing
5+
collective communication, including: ``han``, ``libnbc``, ``self``, ``ucc`` ``base``,
6+
``hcoll``, ``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``,
7+
and ``tuned``. Some of these components may not be available depending on how
8+
Open MPI was compiled and what hardware is available on the system. A run-time
9+
decision based on each component's self reported priority, selects which
10+
component will be used. These priorities may be adjusted on the command line
11+
or with any of the other usual ways of setting MCA variables, giving us a way
12+
to influence or override component selection. In the end, which of the
13+
available components is selected depends on a number of factors such as the
14+
underlying hardware and the whether or not a specific collective is provided by
15+
the component as not all components implement all collectives. However, there
16+
is always a fallback ``base`` component that steps in and takes over when another
17+
component fails to provide an implementation.
18+
19+
The remainder of this section describes the tuning options available in the
20+
``tuned`` collective component.
21+
22+
Fixed, Forced, and Dynamic Decisions
23+
------------------------------------
24+
Open MPI's ``tuned`` collective component has three modes of operation, *fixed
25+
decision*, *forced algorithm*, and *dynamic decision*. Since different
26+
collective algorithms perform better in different situations, the purpose of
27+
these modes is, when a collective is called, to select an appropriate
28+
algorithm.
29+
30+
In the *fixed decision* mode a decision tree, essentially a large set of nested
31+
if-then-else-if blocks with baked in comm and message size thresholds derived
32+
by measuring performance on existing clusters, selects one of a number of
33+
available algorithms for the specific collective being called. *Fixed
34+
decision* is the default. The *fixed decision* mode can work well if your
35+
cluster hardware is similar to the systems profiled when constructing the
36+
decision tree. However, in some cases the *fixed decision* rules can yield
37+
less than ideal performance. For instance, when your hardware differs
38+
substantially from that which was used to derive the fixed rules decision
39+
tree.
40+
41+
In the *dynamic decision* mode a user can provide a set of rules encoded in an
42+
ASCII file that tell the ``tuned`` component which algorithm should be for one or
43+
more collectives as a function of communicator and message size. For the
44+
collectives which rules are not provided the *fixed decision* mode is used.
45+
More detail about the dynamic decision mode and rules file can be found in
46+
section :ref:`Rules File<RulesFile>`.
47+
48+
In the *forced algorithm* mode one can specify an algorithm for a specific
49+
collective on the command line. This is discussed in detail in section
50+
:ref:`Forcing an Algorithm<ForceAlg>`.
51+
52+
.. _ForceAlg:
53+
54+
Forcing an Algorithm
55+
--------------------
56+
57+
The simplest method of tuning is to force the ``tuned`` collective component to
58+
use a specific algorithm for all calls to a specific collective. This can be
59+
done on the command line, or by any of the other usual ways of setting MCA
60+
variables. For example, the following command line sets the algorithm to
61+
linear for all MPI_Alltoall calls in the entire run.
62+
63+
.. code-block:: sh
64+
65+
shell$ mpirun ... --mca coll_tuned_use_dynamic_rules 1 \
66+
--mca coll_tuned_alltoall_algorithm 1 ...
67+
68+
When an algorithm is selected this way, the *fixed decision* rules associated
69+
with that collective are short circuited. This is an easy way to select a
70+
collective algorithm and is most effective when there is a single communicator
71+
and message sizes are static with in the run. On the other hand, when the
72+
message size for a given collective varies during the run, or there are
73+
multiple communicators of different sizes, the effectiveness of this approach
74+
may be limited. The :ref:`rules file <RulesFile>` provides a more comprehensive
75+
and flexible approach to tuning. For collectives where an algorithm is not
76+
forced on the command line, the *fixed decision* rules are used.
77+
78+
.. _RulesFile:
79+
80+
Dynamic Decisions and the Rules File
81+
------------------------------------
82+
83+
Given that the best choice of algorithm for a given collective depends on a
84+
number of factors only known at run time, and that some of these factors may
85+
vary with in a run, setting an algorithm on the command line often is an
86+
ineffective means of tuning. The rules file provides a means of choosing
87+
an algorithm at run time based on communicator and message size. The rules
88+
file can be specified on the command line, or the other usual ways to set MCA
89+
variables. The file is parsed during initialization and takes affect there
90+
after.
91+
92+
.. code-block:: sh
93+
94+
shell$ mpirun ... --mca coll_tuned_use_dynamic_rules 1 \
95+
--mca coll_tuned_dynamic_rules_filename /path/to/my_rules.conf ...
96+
97+
The loaded set of rules then are used to select the algorithm
98+
to use based on the collective, the communicator size, and the message size.
99+
Collectives for which rules have not be specified in the file will make use of
100+
the *fixed decision* rules as usual.
101+
102+
Dynamic tuning files are organized in this format:
103+
104+
.. code-block:: sh
105+
:linenos:
106+
107+
1 # Number of collectives
108+
1 # Collective ID
109+
1 # Number of comm sizes
110+
2 # Comm size
111+
2 # Number of message sizes
112+
0 1 0 0 # Message size 0, algorithm 1, topo and segmentation at 0
113+
1024 2 0 0 # Message size 1024, algorithm 2, topo and segmentation at 0
114+
115+
The rules file effectively defines, for one or more collectives, a function of
116+
two variables, which given communicator and message size, returns an algorithm
117+
id to use for the call. This mechanism allows one to specify for each
118+
collective, an algorithm for any number of ranges of message and communicator
119+
sizes. As communicators are constructed, a search of the rules table is made
120+
using the communicator size to select a set of message size rules to be used
121+
with that communicator. Later as the collective is invoked, a search of the
122+
message size rules associated with the communicator is made. The rule with the
123+
nearest (less than) matching message size specifies the algorithm that is used.
124+
The actual definition of *message size* is dependent on the collective in
125+
question, see the section on :ref:`collectives and algorithms<CollectivesAndAlgorithms>`
126+
for details.
127+
128+
One may provide rules for as many collectives, communicator sizes, and message
129+
sizes as desired. Simply repeat the sections as needed and adjust the relevant
130+
count parameters. One must always provide a rule for message size of zero.
131+
Message size rules are expected in ascending order. The last two parameters in
132+
the rule may or may not be used and have different meaning depending on the
133+
collective and algorithm. As of writing not all of the relevant control
134+
parameters can be set by the rules file (See issue #12589).
135+
136+
.. _CollectivesAndAlgorithms:
137+
138+
Collectives and their Algorithms
139+
--------------------------------
140+
The following table lists the collectives
141+
implemented by the ``tuned`` collective component along with the enumeration value identifying it.
142+
It is this value that must be used in the rules file when specifying a set of rules.
143+
The definition of *message size* is dependent on the collective and is given in the table.
144+
Tables describing the algorithms available for each collective, and there identifiers, are linked.
145+
146+
.. csv-table:: Collectives
147+
:header: "Collective", "Id", "Message Size"
148+
:widths: 20, 10, 65
149+
150+
:ref:`Allgather<Allgather>`, 0, "datatype size * comm size * number of elements in send buffer"
151+
:ref:`Allgatherv<Allgatherv>`, 1, "datatype size * sum of number of elements that are to be received from each process (sum of recvcounts)"
152+
:ref:`Allreduce<Allreduce>`, 2, "datatype size * number of elements in send buffer"
153+
:ref:`Alltoall<Alltoall>`, 3, "datatype size * comm size * number of elements to send to each process"
154+
:ref:`Alltoallv<Alltoallv>`, 4, "not used"
155+
:ref:`Barrier<Barrier>`, 6, "not used"
156+
:ref:`Bcast<Bcast>`, 7, "datatype size * number of entries in buffer"
157+
:ref:`Exscan<Exscan>`, 8, "datatype size * comm size"
158+
:ref:`Gather<Gather>`, 9, "datatype size * comm size * number of elements in send buffer"
159+
:ref:`Reduce<Reduce>`, 11, "datatype size * number of elements in send buffer"
160+
:ref:`Reduce_scatter<Reduce_scatter>`, 12, "datatype size * sum of number of elements in result distributed to each process (sum of recvcounts)"
161+
:ref:`Reduce_scatter_block<Reduce_scatter_block>`, 13, "datatype size * comm size * element count per block"
162+
:ref:`Scan<Scan>`, 14, "datatype size * comm size"
163+
:ref:`Scatter<Scatter>`, 15, "datatype size * number of elements in send buffer"
164+
165+
.. _Allgather:
166+
167+
Allgather (Id=0)
168+
~~~~~~~~~~~~~~~~
169+
170+
.. csv-table:: Allgather Algorithms
171+
:header: "Id", "Name", "Description"
172+
:widths: 10, 25, 60
173+
174+
0, "ignore", "Use fixed rules"
175+
1, "linear", "..."
176+
2, "bruck-k-fanout", "..."
177+
3, "recursive_doubling", "..."
178+
4, "ring", "..."
179+
5, "neighbor", "..."
180+
6, "two_proc", "..."
181+
7, "sparbit", "..."
182+
8, "direct-messaging", "..."
183+
184+
.. _Allgatherv:
185+
186+
Allgatherv (Id=1)
187+
~~~~~~~~~~~~~~~~~
188+
189+
.. csv-table:: Allgatherv Algorithms
190+
:header: "Id", "Name", "Description"
191+
:widths: 10, 25, 60
192+
193+
0, "ignore", "Use fixed rules"
194+
1, "default", "..."
195+
2, "bruck", "..."
196+
3, "ring", "..."
197+
4, "neighbor", "..."
198+
5, "two_proc", "..."
199+
6, "sparbit", "..."
200+
201+
.. _Allreduce:
202+
203+
Allreduce (Id=2)
204+
~~~~~~~~~~~~~~~~
205+
206+
.. csv-table:: Allreduce Algorithms
207+
:header: "Id", "Name", "Description"
208+
:widths: 10, 25, 60
209+
210+
0, "ignore", "Use fixed rules"
211+
1, "basic_linear", "..."
212+
2, "nonoverlapping", "..."
213+
3, "recursive_doubling", "..."
214+
4, "ring", "..."
215+
5, "segmented_ring", "..."
216+
6, "rabenseifner", "..."
217+
7, "allgather_reduce", "..."
218+
219+
.. _Alltoall:
220+
221+
Alltoall (Id=3)
222+
~~~~~~~~~~~~~~~
223+
224+
.. csv-table:: Alltoall Algorithms
225+
:header: "Id", "Name", "Description"
226+
:widths: 10, 25, 60
227+
228+
0, "ignore", "Use fixed rules"
229+
1, "linear", "Launches all non-blocking send/recv pairs and wait for them to complete."
230+
2, "pairwise", "For comm size P, implemented as P rounds of blocking MPI_Sendrecv"
231+
3, "modified_bruck", "An algorithm exploiting network packet quantization to achieve O(log) time complexity. Typically best for very small message sizes."
232+
4, "linear_sync", "Keep N non-blocking MPI_Isend/Irecv pairs in flight at all times. N is set by the coll_tuned_alltoall_max_requests MCA variable."
233+
5, "two_proc", "An implementation tailored for alltoall between 2 ranks, otherwise it is not used."
234+
235+
.. _Alltoallv:
236+
237+
Alltoallv (Id=4)
238+
~~~~~~~~~~~~~~~~
239+
240+
.. csv-table:: Alltoallv Algorithms
241+
:header: "Id", "Name", "Description"
242+
:widths: 10, 25, 60
243+
244+
0, "ignore", "Use fixed rules"
245+
1, "basic_linear", "..."
246+
2, "pairwise", "..."
247+
248+
.. _Barrier:
249+
250+
Barrier (Id=6)
251+
~~~~~~~~~~~~~~
252+
253+
.. csv-table:: Barrier Algorithms
254+
:header: "Id", "Name", "Description"
255+
:widths: 10, 25, 60
256+
257+
0, "ignore", "Use fixed rules"
258+
1, "linear", "..."
259+
2, "double_ring", "..."
260+
3, "recursive_doubling", "..."
261+
4, "bruck", "..."
262+
5, "two_proc", "..."
263+
6, "tree", "..."
264+
265+
.. _Bcast:
266+
267+
Bcast (Id=7)
268+
~~~~~~~~~~~~
269+
270+
.. csv-table:: Bcast Algorithms
271+
:header: "Id", "Name", "Description"
272+
:widths: 10, 25, 60
273+
274+
0, "ignore", "Use fixed rules"
275+
1, "basic_linear", "..."
276+
2, "chain", "..."
277+
3, "pipeline", "..."
278+
4, "split_binary_tree", "..."
279+
5, "binary_tree", "..."
280+
6, "binomial", "..."
281+
7, "knomial", "..."
282+
8, "scatter_allgather", "..."
283+
9, "scatter_allgather_ring", "..."
284+
285+
.. _Exscan:
286+
287+
Exscan (Id=8)
288+
~~~~~~~~~~~~~
289+
290+
.. csv-table:: Exscan Algorithms
291+
:header: "Id", "Name", "Description"
292+
:widths: 10, 25, 60
293+
294+
0, "ignore", "Use fixed rules"
295+
1, "linear", "..."
296+
2, "recursive_doubling", "..."
297+
298+
.. _Gather:
299+
300+
Gather (Id=9)
301+
~~~~~~~~~~~~~
302+
303+
.. csv-table:: Gather Algorithms
304+
:header: "Id", "Name", "Description"
305+
:widths: 10, 25, 60
306+
307+
0, "ignore", "Use fixed rules"
308+
1, "basic_linear", "..."
309+
2, "binomial", "..."
310+
3, "linear_sync", "..."
311+
312+
.. _Reduce:
313+
314+
Reduce (Id=11)
315+
~~~~~~~~~~~~~~
316+
317+
.. csv-table:: Reduce Algorithms
318+
:header: "Id", "Name", "Description"
319+
:widths: 10, 25, 60
320+
321+
0, "ignore", "Use fixed rules"
322+
1, "linear", "..."
323+
2, "chain", "..."
324+
3, "pipeline", "..."
325+
4, "binary", "..."
326+
5, "binomial", "..."
327+
6, "in-order_binary", "..."
328+
7, "rabenseifner", "..."
329+
8, "knomial", "..."
330+
331+
.. _Reduce_scatter:
332+
333+
Reduce_scatter (Id=12)
334+
~~~~~~~~~~~~~~~~~~~~~~
335+
336+
.. csv-table:: Reduce_scatter Algorithms
337+
:header: "Id", "Name", "Description"
338+
:widths: 10, 25, 60
339+
340+
0, "ignore", "Use fixed rules"
341+
1, "non-overlapping", "..."
342+
2, "recursive_halving", "..."
343+
3, "ring", "..."
344+
4, "butterfly", "..."
345+
346+
.. _Reduce_scatter_block:
347+
348+
Reduce_scatter_block (Id=13)
349+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
350+
351+
.. csv-table:: Reduce_scatter_block Algorithms
352+
:header: "Id", "Name", "Description"
353+
:widths: 10, 25, 60
354+
355+
0, "ignore", "Use fixed rules"
356+
1, "basic_linear", "..."
357+
2, "recursive_doubling", "..."
358+
3, "recursive_halving", "..."
359+
4, "butterfly", "..."
360+
361+
.. _Scan:
362+
363+
Scan (Id=14)
364+
~~~~~~~~~~~~
365+
366+
.. csv-table:: Scan Algorithms
367+
:header: "Id", "Name", "Description"
368+
:widths: 10, 25, 60
369+
370+
0, "ignore", "Use fixed rules"
371+
1, "linear", "..."
372+
2, "recursive_doubling", "..."
373+
374+
.. _Scatter:
375+
376+
Scatter (Id=15)
377+
~~~~~~~~~~~~~~~
378+
379+
.. csv-table:: Scatter Algorithms
380+
:header: "Id", "Name", "Description"
381+
:widths: 10, 25, 60
382+
383+
0, "ignore", "Use fixed rules"
384+
1, "basic_linear", "..."
385+
2, "binomial", "..."
386+
3, "linear_nb", "..."
387+

docs/tuning-apps/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,6 @@ components that can be tuned to affect behavior at run time.
1616
large-clusters/index
1717
affinity
1818
mpi-io/index
19+
coll-tuned
1920
benchmarking
2021
heterogeneity

0 commit comments

Comments
 (0)