Skip to content

[Partitioner] Add cost functions to partitioner #2441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 6, 2019

Conversation

nrsatish
Copy link
Contributor

@nrsatish nrsatish commented Feb 25, 2019

Description:
The PR adds cost functions in terms of compute and memory bandwidth costs so that later stages in partitioning can use them.

Testing:
Add a test with manually computed bounds into PartitionerTest.

Documentation:
The PR adds a field to the Partitioner class: ComputeTimeMapTy computeTime_ in Partitioner.h. It adds a function to fill out these fields.

The field is a map from each Node in the Function being partitioned to the corresponding roofline for the op. This roofline is computed as the max of compute time, and the SRAM/DRAM read+write times for the inputs and outputs of the node. In order to compute these rooflines, fields have been added to DeviceInfo struct in RuntimeTypes.h

The PR is related to Graph Partitioning #2298

@nrsatish nrsatish changed the title [WIP][glow][partitioning] Add cost functions to partitioner [glow][partitioning] Add cost functions to partitioner Feb 25, 2019
@beicy beicy changed the title [glow][partitioning] Add cost functions to partitioner [Partitioner] Add cost functions to partitioner Feb 25, 2019
@beicy
Copy link
Contributor

beicy commented Feb 25, 2019

Thanks @nrsatish for working on it! If possible, could you please for add some open sourced info about roofline (like why the roofline is computed as the max of compute time (only for MatMul nodes right now) )?
In addition, I changed the title of this PR.

@stale
Copy link

stale bot commented Mar 5, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

// Get the product of batch, output height, output dims, output channels
totalOps = resultDims[0];
for (int i = 1; i < resultDims.size(); i++) {
totalOps *= resultDims[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here "i" should be size_t, otherwise, the type check will fail.
Usually, it can be wrote as "for (size_t i = 1, e = resultDims.size(); i < e; i++)"


/// Calculate compute ops. Currently only computed for Matmul, Conv, FC
/// TODO: think about whether this is better off computed inside a Node.
uint64_t totalOps = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to double check again here: in the future, do we need to add the computation for each node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do. At least for memory bytes if not flops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for most ops, flops is less important. There are only a handful of ops here that will be at all compute bound.

/// TODO: think about whether this is better off computed inside a Node.
uint64_t totalOps = 0;
if (node.getKind() == Kinded::Kind::MatMulNodeKind) {
auto *MMN = llvm::dyn_cast<MatMulNode>(&node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using "switch". If we need to add more node type here, "switch" looks better:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Contributor

@beicy beicy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for this work!

…to fill in compute and memory bandwidth bound times for ops
@nrsatish nrsatish force-pushed the preprocess-parallel branch from 9e0bc2b to 77d074c Compare March 6, 2019 01:42
@beicy beicy merged commit 74f88b3 into pytorch:master Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants