Skip to content

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions opal/mca/btl/sm/btl_sm_component.c
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
* Copyright (c) 2019-2021 Google, Inc. All rights reserved.
* Copyright (c) 2021 Nanook Consulting. All rights reserved.
* Copyright (c) 2022 IBM Corporation. All rights reserved.
* Copyright (c) 2022 Computer Architecture and VLSI Systems (CARV)
* Laboratory, ICS Forth. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
Expand Down Expand Up @@ -437,9 +439,9 @@ void mca_btl_sm_poll_handle_frag(mca_btl_sm_hdr_t *hdr, struct mca_btl_base_endp
.cbdata = reg->cbdata};

if (hdr->flags & MCA_BTL_SM_FLAG_SINGLE_COPY) {
void *ctx = MCA_SMSC_CALL(map_peer_region, endpoint->smsc_endpoint, /*flags=*/0,
hdr->sc_iov.iov_base, hdr->sc_iov.iov_len,
&segments[1].seg_addr.pval);
void *ctx = MCA_SMSC_CALL(map_peer_region, endpoint->smsc_endpoint,
MCA_RCACHE_FLAGS_PERSIST, hdr->sc_iov.iov_base,
hdr->sc_iov.iov_len, &segments[1].seg_addr.pval);
assert(NULL != ctx);

segments[1].seg_len = hdr->sc_iov.iov_len;
Expand Down
9 changes: 3 additions & 6 deletions opal/mca/smsc/xpmem/smsc_xpmem_component.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2021 Google, Inc. All rights reserved.
* Copyright (c) 2022 Computer Architecture and VLSI Systems (CARV)
* Laboratory, ICS Forth. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
Expand Down Expand Up @@ -76,10 +78,7 @@ static int mca_smsc_xpmem_component_open(void)

static int mca_smsc_xpmem_component_close(void)
{
if (mca_smsc_xpmem_module.vma_module) {
OBJ_RELEASE(mca_smsc_xpmem_module.vma_module);
}

/* nothing to do */
return OPAL_SUCCESS;
}

Expand Down Expand Up @@ -161,7 +160,5 @@ static mca_smsc_module_t *mca_smsc_xpmem_component_enable(void)
mca_smsc_xpmem_component.log_attach_align
= opal_min(opal_max(mca_smsc_xpmem_component.log_attach_align, 12), 25);

mca_smsc_xpmem_module.vma_module = mca_rcache_base_vma_module_alloc();

return &mca_smsc_xpmem_module.super;
}
8 changes: 4 additions & 4 deletions opal/mca/smsc/xpmem/smsc_xpmem_internal.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2021 Google, Inc. All rights reserved.
* Copyright (c) 2022 Computer Architecture and VLSI Systems (CARV)
* Laboratory, ICS Forth. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
Expand Down Expand Up @@ -42,6 +44,8 @@ struct mca_smsc_xpmem_endpoint_t {
xpmem_apid_t apid;
/** maximum address we can attach to on this peer */
uintptr_t address_max;
/** cache of xpmem attachments created using this endpoint */
mca_rcache_base_vma_module_t *vma_module;
};

typedef struct mca_smsc_xpmem_endpoint_t mca_smsc_xpmem_endpoint_t;
Expand All @@ -67,10 +71,6 @@ typedef struct mca_smsc_xpmem_component_t mca_smsc_xpmem_component_t;

struct mca_smsc_xpmem_module_t {
mca_smsc_module_t super;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical about the interest of this particular change. If we assume the communication library is only about point-to-point communications without reuse of buffers, then this change might make sense. But I don't think this is the most obvious pattern in which MPI is used, and as soon as we add collective communications to the mix sharing registrations is essential.

Let's imagine an MPI_Allreduce on a very fat node. The temporary buffer used in the collective will be registered for most peers (depending on the collective algorithm, but more than once) instead of being registered once. I understand this is a balance between search time (which with your approach is now logarithmic) and total number of registrations.

It would be interesting to see the impact of the proposed change on different scenarios, where shared and individual registrations can be assessed. Can you run a benchmark (preferably IMB) for the different usage registration patterns of the user and temporary buffers. Basically MPI_Allreduce, MPI_Allgather, and MPI_Alltoall should give a good picture of the impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we want sharing of registrations between components in the future. I have in mind that we should implement a structure that caches associations of processes to smsc/xpmem endpoints. So when each component requests an endpoint, we return an already existing one if there is, achieving sharing of xpmem attachements to the specific process between different components.

If I understood your point correctly, we shouldn't have more total registrations with find() compared to iterate(). And the previous code did not achieve sharing between different components either. Even though it kept all registrations in the same structure, it filtered them by smsc endpoint. Now the registrations are scattered between more trees, but the nubmer of regs per endpoint remains the same.

I've ran some numbers to demonstrate the potential of this change, some of them are attached on the original comment above. For example in my collectives component (which uses smsc/xpmem, and I would actually plan to make a PR for in the upcoming days) there's a non-hierarchical allreduce version under which for small messages the root attaches to every other rank's buffer and performs the reduction himself. With 64 procs and with iterate, there are 63 regs in his cache + a couple other as result of the supporting MPI operations in the microbenchmark. With find() there are 1 or 2 or 3 regs to each rank (besides search time, the regs are also spread out to multiple trees). As a result, in this scenario with 64 procs @ 4K message size, the average allreduce latency improves from 100 us to 42 us.

I can also upload some graphs for these measurements to make examination easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please upload the graphs. thanks.

/** cache of xpmem attachments. this cache holds attachments for all peers. the registrations
* are differentiated by the alloc_base which is set to the endpoint. */
mca_rcache_base_vma_module_t *vma_module;
};

typedef struct mca_smsc_xpmem_module_t mca_smsc_xpmem_module_t;
Expand Down
Loading