Skip to content

Commit 1b1dd85

Browse files
committed
opal/ofi: update nic selection function doc
The documentation needs an update to reflect latest implementation. The original cpuset matching logic has been replaced with a new distance calculation algorithm. This change also clarifies the round-robin selection process when we need to break a tie. Signed-off-by: Wenduo Wang <[email protected]> (cherry picked from commit 3aba0bb)
1 parent f9800fd commit 1b1dd85

File tree

2 files changed

+50
-46
lines changed

2 files changed

+50
-46
lines changed

opal/mca/common/ofi/common_ofi.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -670,10 +670,10 @@ static int get_provider_distance(struct fi_info *provider, hwloc_topology_t topo
670670
/**
671671
* @brief Get the nearest device to the current thread
672672
*
673-
* Use the PMIx server or calculate the device distances, then out of the set of
674-
* returned distances find the subset of the nearest devices. This can be
675-
* 0 or more.
676-
* If there are multiple equidistant devices, break the tie using the rank.
673+
* Compute the distances from the current thread to each NIC in provider_list,
674+
* and select the NIC with the shortest distance.
675+
* If there are multiple equidistant devices, break the tie using local rank
676+
* to balance NIC utilization.
677677
*
678678
* @param[in] topoloy hwloc topology
679679
* @param[in] provider_list List of providers to select from
@@ -936,6 +936,10 @@ struct fi_info *opal_common_ofi_select_provider(struct fi_info *provider_list,
936936
package_rank = get_package_rank(process_info);
937937

938938
#if OPAL_OFI_PCI_DATA_AVAILABLE
939+
/**
940+
* If provider PCI BDF information is available, we calculate its physical distance
941+
* to the current process, and select the provider with the shortest distance.
942+
*/
939943
ret = get_nearest_nic(opal_hwloc_topology, provider_list, num_providers, package_rank,
940944
&provider);
941945
if (OPAL_SUCCESS == ret) {

opal/mca/common/ofi/common_ofi.h

Lines changed: 42 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -135,47 +135,47 @@ OPAL_DECLSPEC int opal_common_ofi_providers_subset_of_list(struct fi_info *provi
135135
/**
136136
* Selects NIC (provider) based on hardware locality
137137
*
138-
* In multi-nic situations, use hardware topology to pick the "best"
139-
* of the selected NICs.
140-
* There are 3 main cases that this covers:
141-
*
142-
* 1. If the first provider passed into this function is the only valid
143-
* provider, this provider is returned.
144-
*
145-
* 2. If there is more than 1 provider that matches the type of the first
146-
* provider in the list, and the BDF data
147-
* is available then a provider is selected based on locality of device
148-
* cpuset and process cpuset and tries to ensure that processes
149-
* are distributed evenly across NICs. This has two separate
150-
* cases:
151-
*
152-
* i. There is one or more provider local to the process:
153-
*
154-
* (local rank % number of providers of the same type
155-
* that share the process cpuset) is used to select one
156-
* of these providers.
157-
*
158-
* ii. There is no provider that is local to the process:
159-
*
160-
* (local rank % number of providers of the same type)
161-
* is used to select one of these providers
162-
*
163-
* 3. If there is more than 1 providers of the same type in the
164-
* list, and the BDF data is not available (the ofi version does
165-
* not support fi_info.nic or the provider does not support BDF)
166-
* then (local rank % number of providers of the same type) is
167-
* used to select one of these providers
168-
*
169-
* @param provider_list (IN) struct fi_info* An initially selected
170-
* provider NIC. The provider name and
171-
* attributes are used to restrict NIC
172-
* selection. This provider is returned if the
173-
* NIC selection fails.
174-
*
175-
* @param provider (OUT) struct fi_info* object with the selected
176-
* provider if the selection succeeds
177-
* if the selection fails, returns the fi_info
178-
* object that was initially provided.
138+
* The selection is based on the following priority:
139+
*
140+
* Single-NIC:
141+
*
142+
* If only 1 provider is available, always return that provider.
143+
*
144+
* Multi-NIC:
145+
*
146+
* 1. If the process is NOT bound, pick a NIC using (local rank % number
147+
* of providers of the same type). This gives a fair chance to each
148+
* qualified NIC and balances overall utilization.
149+
*
150+
* 2. If the process is bound, we compare providers in the list that have
151+
* the same type as the first provider, and find the provider with the
152+
* shortest distance to the current process.
153+
*
154+
* i. If the provider has PCI BDF data, we attempt to compute the
155+
* distance between the NIC and the current process cpuset. The NIC
156+
* with the shortest distance is returned.
157+
*
158+
* * For equidistant NICs, we select a NIC in round-robin fashion
159+
* using the package rank of the current process, i.e. (package
160+
* rank % number of providers with the same distance).
161+
*
162+
* ii. If we cannot compute the distance between the NIC and the
163+
* current process, e.g. PCI BDF data is not available, a NIC will be
164+
* selected in a round-robin fashion using package rank, i.e. (package
165+
* rank % number of providers of the same type).
166+
*
167+
* @param[in] provider_list struct fi_info* An initially selected
168+
* provider NIC. The provider name and
169+
* attributes are used to restrict NIC
170+
* selection. This provider is returned if the
171+
* NIC selection fails.
172+
*
173+
* @param[in] process_info opal_process_info_t* The current process info
174+
*
175+
* @param[out] provider struct fi_info* object with the selected
176+
* provider if the selection succeeds
177+
* if the selection fails, returns the fi_info
178+
* object that was initially provided.
179179
*
180180
* All errors should be recoverable and will return the initially provided
181181
* provider. However, if an error occurs we can no longer guarantee
@@ -184,7 +184,7 @@ OPAL_DECLSPEC int opal_common_ofi_providers_subset_of_list(struct fi_info *provi
184184
*
185185
*/
186186
OPAL_DECLSPEC struct fi_info *opal_common_ofi_select_provider(struct fi_info *provider_list,
187-
opal_process_info_t *process_info);
187+
opal_process_info_t *process_info);
188188

189189
/**
190190
* Obtain EP endpoint name

0 commit comments

Comments
 (0)