Skip to content

coll/han: dynamic selection does not work for simple algorithms #9883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mkurnosov opened this issue Jan 17, 2022 · 9 comments
Open

coll/han: dynamic selection does not work for simple algorithms #9883

mkurnosov opened this issue Jan 17, 2022 · 9 comments
Assignees

Comments

@mkurnosov
Copy link
Contributor

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

9704f0f (master)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

4d07260d9f79bb7f328b1fc9107b45e683cf2c4e ../../../../3rd-party/openpmix (v1.1.3-3319-g4d07260d) 9ac0b7ecee2c97c357bf6751fdaab7a10e62df14 ../../../../3rd-party/prrte (psrvr-v2.0.0rc1-4133-g9ac0b7ec)

Please describe the system on which you are running

  • Operating system/version: Linux 4.16.3-301.fc28.x86_64
  • Computer hardware:
  • Network type: InfiniBand

Details of the problem

Dynamic selection provided via MCA parameters does not work for simple algorithms. Simple algorithm (coll_han_use_simple_<op>) splits global communicator into intra- and inter-node sub-communicators with disabled HAN component (mca_coll_han_comm_create()):

opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han");

By this reason on sub-communicators simple algorithm uses a collective operation from component with a highest priority.

In the following example we want to choose Bcast from tuned component for intra- and inter-node communication. But simple algorithm calls Bcast from basic component (component with a highest priority).

mpiexec --host cn2:8,cn3:8,cn4:8,cn5:8,cn6:8 --n 40 \
        --map-by core --bind-to core --mca pml ucx \
        --mca coll_basic_priority 90 \
        --mca coll_libnbc_priority 10 \
        --mca coll_adapt_priority 0 \
        --mca coll_sm_priority 0 \
        --mca coll_han_priority 100 \
        --mca coll_han_bcast_dynamic_intra_node_module 4 \
        --mca coll_han_bcast_dynamic_inter_node_module 4 \
        --mca coll_han_use_simple_bcast 1 \
        ./bcast_test
@devreal
Copy link
Contributor

devreal commented Jan 18, 2022

@mkurnosov Thanks for the report! I'm not quite sure I can follow the problem here. Do you see an error occurring with the command you posted? I ran it locally and the command succeeds for me. The call to opal_info_set should override the MCA selection for the underlying collectives implementation used by HAN...

@devreal devreal self-assigned this Jan 18, 2022
@mkurnosov
Copy link
Contributor Author

@devreal the program terminates without an error. In the example above, we expect that HAN calls Bcast from tuned component, but it uses basic component. It can be catched by debug printf in basic/bcast.

@bosilca
Copy link
Member

bosilca commented Jan 18, 2022

This case should be handled in the current code. Check the code at the end of the check_components function in ompi/mca/coll/base/coll_base_comm_select.c line 450. After removing all excluded collective components we reorder the selected components using the include order, which should have given you tuned first.

I don't have your test and I could not find it in ompi-tests. So, let's try to following patch. Please run with --mca coll_base_verbose 10 to see what components are reordered and how.

diff --git a/ompi/mca/coll/base/coll_base_comm_select.c b/ompi/mca/coll/base/coll_base_comm_select.c
index fcdb8649eb..db039c2328 100644
--- a/ompi/mca/coll/base/coll_base_comm_select.c
+++ b/ompi/mca/coll/base/coll_base_comm_select.c
@@ -461,6 +461,9 @@ static opal_list_t *check_components(opal_list_t * components,
             if (0 == strcmp(item->ac_component_name, coll_include[idx])) {
                 opal_list_remove_item(selectable, &item->super);
                 opal_list_append(selectable, &item->super);
+                opal_output_verbose(10, ompi_coll_base_framework.framework_output,
+                                    "coll:base:comm_select: component %s reordered based on info key (these messages appear in the reverse order)",
+                                    component->mca_component_name);
                 break;
             }
         }

@mkurnosov
Copy link
Contributor Author

@bosilca
Thanks for the patch. The problem is that coll_include[] array is empty (count_include = 0) because info key does not include components (only "^han"). coll_han_subcomms.c, line 119:

    opal_info_set(&comm_info, "ompi_comm_coll_preference", "^han");
    opal_info_set(&comm_info, "ompi_comm_coll_han_topo_level", "INTRA_NODE");
    ompi_comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0,
                         &comm_info, low_comm);

@bosilca
Copy link
Member

bosilca commented Jan 19, 2022

The it is not clear to me what is the issue you are trying to solve. tuned priority is 30 by default and you manually set the priority of basic to 90. The simple communicator duplication method in HAN does not preferentially select tuned, and as a result the basic algorithms will be used (because forced priority). All this seems normal.

@mkurnosov
Copy link
Contributor Author

I just want to point out that HAN's MCA parameters for module selection are ignored:

        --mca coll_han_bcast_dynamic_intra_node_module 4
        --mca coll_han_bcast_dynamic_inter_node_module 4

@bosilca
Copy link
Member

bosilca commented Jan 20, 2022

That's a different issue, it has nothing to do with priorities nor with the ompi_comm_coll_preference info key. The dynamic selection of the module should be done by the get_module function in coll_han_dynamic.c, where it picks among all available modules the one indicated by the user rule.

I will try to take a look tomorrow to see why it does not do the right thing. What test are you using ?

@mkurnosov
Copy link
Contributor Author

I am using simple Bcast test and patched ompi with debug printf in coll/{basic,tuned}:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <mpi.h>

int main(int argc, char **argv)
{
    int rank, commsize;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &commsize);

    int root = 0;
    int count = 1024;
    uint8_t *buf = malloc(sizeof(*buf) * count);

    if (rank == root) {
        for (int i = 0; i < count; i++) {
            buf[i] = (i + 1) % 256;
        }
    }

    MPI_Bcast(buf, count, MPI_UINT8_T, root, MPI_COMM_WORLD);

    MPI_Finalize();
    return 0;
}

mpiexec:

mpiexec --host cn2:2,cn3:2 --n 4 \
        --map-by core --bind-to core --mca pml ucx \
        --mca coll_base_verbose 0 \
        --mca coll_basic_priority 90 \
        --mca coll_libnbc_priority 10 \
        --mca coll_adapt_priority 0 \
        --mca coll_sm_priority 0 \
        --mca coll_han_priority 100 \
        --mca coll_han_bcast_dynamic_intra_node_module 4 \
        --mca coll_han_bcast_dynamic_inter_node_module 4 \
        --mca coll_han_use_simple_bcast 1 \
        ./bcast_test

@gkatev
Copy link
Contributor

gkatev commented Feb 18, 2022

I've also been taking a look into HAN's code these days, and was a bit confused with the component selection, especially with the dynamic MCA parameters.

From what I've gathered:

  • The _simple (have only actually looked at allreduce) functions utilize mca_coll_han_comm_create_new().

    • mca_coll_han_comm_create_new() creates intra/inter modules, simply excluding HAN.
      opal_info_set(&comm_info, "ompi_comm_coll_preference", "^han");`
      ...
      ompi_comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0, &comm_info, low_comm);
      ...
      ompi_comm_split_with_info(comm, low_rank, w_rank, &comm_info, up_comm, false);
      
    • Therefore, the selected module on each sub-comm is determined according to the priorities.
  • The non-simple functions utilize mca_coll_han_comm_create (no _new suffix (!)).

    • mca_coll_han_comm_create() creates 2 modules for each topo-level (intra/inter), with fixed preferences.
      opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han");
      ompi_comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0, &comm_info, &(low_comms[0]));
      ...
      opal_info_set(&comm_info, "ompi_comm_coll_preference", "sm,^han");
      ompi_comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0, &comm_info, &(low_comms[1]));
      ...
      opal_info_set(&comm_info, "ompi_comm_coll_preference", "libnbc,^han");
      ompi_comm_split_with_info(comm, low_rank, w_rank, &comm_info, &(up_comms[0]), false)
      ...
      opal_info_set(&comm_info, "ompi_comm_coll_preference", "adapt,^han");
      ompi_comm_split_with_info(comm, low_rank, w_rank, &comm_info, &(up_comms[1]), false);
      
    • These modules are cached, and the one among them to be used can be selected via coll_han_<coll>_low_module, coll_han_<coll>_up_module (non-dynamic).

Where it gets strange, is that these are the only occurences of ompi_comm_split_ in HAN, and the dynamic rules do not seem to be included in the creation at all.

I see how the selection code in get_module() works, but it looks like it only ever gets called on GLOBAL_COMMUNICATOR, and it does not create any sub-communicators.

Furthermore, setting something like --mca coll_han_allreduce_dynamic_global_communicator_module 3 (3 = coll/tuned) causes the application to crash, presumably because then HAN's allreduce gets called, but with tuned's module. (4th case in mca_coll_han_allreduce_intra_dynamic()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants