-
Notifications
You must be signed in to change notification settings - Fork 699
[Partitioner] Add cost functions to partitioner #2441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @nrsatish for working on it! If possible, could you please for add some open sourced info about roofline (like why the roofline is computed as the max of compute time (only for MatMul nodes right now) )? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
// Get the product of batch, output height, output dims, output channels | ||
totalOps = resultDims[0]; | ||
for (int i = 1; i < resultDims.size(); i++) { | ||
totalOps *= resultDims[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here "i" should be size_t, otherwise, the type check will fail.
Usually, it can be wrote as "for (size_t i = 1, e = resultDims.size(); i < e; i++)"
|
||
/// Calculate compute ops. Currently only computed for Matmul, Conv, FC | ||
/// TODO: think about whether this is better off computed inside a Node. | ||
uint64_t totalOps = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to double check again here: in the future, do we need to add the computation for each node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do. At least for memory bytes if not flops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for most ops, flops is less important. There are only a handful of ops here that will be at all compute bound.
/// TODO: think about whether this is better off computed inside a Node. | ||
uint64_t totalOps = 0; | ||
if (node.getKind() == Kinded::Kind::MatMulNodeKind) { | ||
auto *MMN = llvm::dyn_cast<MatMulNode>(&node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using "switch". If we need to add more node type here, "switch" looks better:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot for this work!
…to fill in compute and memory bandwidth bound times for ops
9e0bc2b
to
77d074c
Compare
Description:
The PR adds cost functions in terms of compute and memory bandwidth costs so that later stages in partitioning can use them.
Testing:
Add a test with manually computed bounds into PartitionerTest.
Documentation:
The PR adds a field to the Partitioner class: ComputeTimeMapTy computeTime_ in Partitioner.h. It adds a function to fill out these fields.
The field is a map from each Node in the Function being partitioned to the corresponding roofline for the op. This roofline is computed as the max of compute time, and the SRAM/DRAM read+write times for the inputs and outputs of the node. In order to compute these rooflines, fields have been added to DeviceInfo struct in RuntimeTypes.h
The PR is related to Graph Partitioning #2298