Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4126,7 +4126,8 @@ piEnqueueMemBufferMap(pi_queue Queue, pi_mem Buffer, pi_bool BlockingMap,
piEventsWait(NumEventsInWaitList, EventWaitList);
if (Buffer->MapHostPtr) {
*RetMap = Buffer->MapHostPtr + Offset;
memcpy(*RetMap, pi_cast<char *>(Buffer->getZeHandle()) + Offset, Size);
if (!(MapFlags & CL_MAP_WRITE_INVALIDATE_REGION))
memcpy(*RetMap, pi_cast<char *>(Buffer->getZeHandle()) + Offset, Size);
} else {
*RetMap = pi_cast<char *>(Buffer->getZeHandle()) + Offset;
}
Expand Down
21 changes: 6 additions & 15 deletions sycl/source/detail/scheduler/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1141,21 +1141,12 @@ cl_int MemCpyCommand::enqueueImp() {

auto RawEvents = getPiEvents(EventImpls);

// Omit copying if mode is discard one.
// TODO: Handle this at the graph building time by, for example, creating
// empty node instead of memcpy.
if (MDstReq.MAccessMode == access::mode::discard_read_write ||
MDstReq.MAccessMode == access::mode::discard_write ||
MSrcAllocaCmd->getMemAllocation() == MDstAllocaCmd->getMemAllocation()) {
Command::waitForEvents(Queue, EventImpls, Event);
} else {
MemoryManager::copy(
MSrcAllocaCmd->getSYCLMemObj(), MSrcAllocaCmd->getMemAllocation(),
MSrcQueue, MSrcReq.MDims, MSrcReq.MMemoryRange, MSrcReq.MAccessRange,
MSrcReq.MOffset, MSrcReq.MElemSize, MDstAllocaCmd->getMemAllocation(),
MQueue, MDstReq.MDims, MDstReq.MMemoryRange, MDstReq.MAccessRange,
MDstReq.MOffset, MDstReq.MElemSize, std::move(RawEvents), Event);
}
MemoryManager::copy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA requires no change. OpenCL will need to be changed in the library itself.

Does it mean that this unconditional copy is expected to slow down OpenCL execution currently?
And why CUDA does not need a change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my commit comment could be clearer. There are two sides: unified memory and not.

This particular line of code you are commenting on is for when we do not have unified memory. In that case, the underlying backend doesn't really matter. SYCL itself is scheduling individual mem read/writes to maintain the coherency, and this optimization will avoid the needless mem read.

The comment about CUDA requires no change is in the context of when there is unified memory. Then we are scheduling paired map/unmap operations. Any optimization will have to be performed by the backend (or it's PI interface). In the case of Level Zero, this PR adds the optimization to its PI interface. In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that). For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that)

Could you double check and spot it in the CUDA PI plugin sources?

For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).

This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?

Copy link
Contributor Author

@cperkinsintel cperkinsintel Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check and spot it in the CUDA PI plugin sources?

It's at

CL_MAP_WRITE_INVALIDATE_REGION))) {
. with a supporting comment, but I never step traced it. But the benchmark confirms the difference.

This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?

Agreed. That is the plan.

MSrcAllocaCmd->getSYCLMemObj(), MSrcAllocaCmd->getMemAllocation(),
MSrcQueue, MSrcReq.MDims, MSrcReq.MMemoryRange, MSrcReq.MAccessRange,
MSrcReq.MOffset, MSrcReq.MElemSize, MDstAllocaCmd->getMemAllocation(),
MQueue, MDstReq.MDims, MDstReq.MMemoryRange, MDstReq.MAccessRange,
MDstReq.MOffset, MDstReq.MElemSize, std::move(RawEvents), Event);

return CL_SUCCESS;
}
Expand Down
17 changes: 11 additions & 6 deletions sycl/source/detail/scheduler/graph_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -353,12 +353,17 @@ Command *Scheduler::GraphBuilder::insertMemoryMove(MemObjRecord *Record,
Record->MHostAccess = MapMode;
} else {

// Full copy of buffer is needed to avoid loss of data that may be caused
// by copying specific range from host to device and backwards.
NewCmd =
new MemCpyCommand(*AllocaCmdSrc->getRequirement(), AllocaCmdSrc,
*AllocaCmdDst->getRequirement(), AllocaCmdDst,
AllocaCmdSrc->getQueue(), AllocaCmdDst->getQueue());
if ((Req->MAccessMode == access::mode::discard_write) ||
(Req->MAccessMode == access::mode::discard_read_write)) {
return nullptr;
} else {
// Full copy of buffer is needed to avoid loss of data that may be caused
// by copying specific range from host to device and backwards.
NewCmd =
new MemCpyCommand(*AllocaCmdSrc->getRequirement(), AllocaCmdSrc,
*AllocaCmdDst->getRequirement(), AllocaCmdDst,
AllocaCmdSrc->getQueue(), AllocaCmdDst->getQueue());
}
}

for (Command *Dep : Deps) {
Expand Down