[Partitioner] Add cost functions to partitioner #2441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

beicy merged 1 commit into pytorch:master from nrsatish:preprocess-parallel

Mar 6, 2019

Contributor

nrsatish commented Feb 25, 2019 •

edited by beicy

Loading

Description:
The PR adds cost functions in terms of compute and memory bandwidth costs so that later stages in partitioning can use them.

Testing:
Add a test with manually computed bounds into PartitionerTest.

Documentation:
The PR adds a field to the Partitioner class: ComputeTimeMapTy computeTime_ in Partitioner.h. It adds a function to fill out these fields.

The field is a map from each Node in the Function being partitioned to the corresponding roofline for the op. This roofline is computed as the max of compute time, and the SRAM/DRAM read+write times for the inputs and outputs of the node. In order to compute these rooflines, fields have been added to DeviceInfo struct in RuntimeTypes.h

The PR is related to Graph Partitioning #2298

facebook-github-bot added the CLA Signed label

nrsatish changed the title ~~[WIP][glow][partitioning] Add cost functions to partitioner~~ [glow][partitioning] Add cost functions to partitioner

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

beicy changed the title ~~[glow][partitioning] Add cost functions to partitioner~~ [Partitioner] Add cost functions to partitioner

Contributor

beicy commented Feb 25, 2019

Thanks @nrsatish for working on it! If possible, could you please for add some open sourced info about roofline (like why the roofline is computed as the max of compute time (only for MatMul nodes right now) )?
In addition, I changed the title of this PR.

opti-mix reviewed

View reviewed changes

include/glow/Partitioner/Partitioner.h Outdated Show resolved Hide resolved

include/glow/Partitioner/Partitioner.h Outdated Show resolved Hide resolved

include/glow/Partitioner/Partitioner.h Outdated Show resolved Hide resolved

include/glow/Partitioner/Partitioner.h Outdated Show resolved Hide resolved

include/glow/Partitioner/Partitioner.h Outdated Show resolved Hide resolved

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

stale bot commented Mar 5, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

stale bot added stale_will_be_closed and removed stale_will_be_closed labels

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp

    
                    // Get the product of batch, output height, output dims, output channels

                    totalOps = resultDims[0];

                    for (int i = 1; i < resultDims.size(); i++) {

                      totalOps *= resultDims[i];

Contributor

beicy Mar 5, 2019

here "i" should be size_t, otherwise, the type check will fail.
Usually, it can be wrote as "for (size_t i = 1, e = resultDims.size(); i < e; i++)"

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp Outdated Show resolved Hide resolved

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp

    
                  /// Calculate compute ops. Currently only computed for Matmul, Conv, FC

                  /// TODO: think about whether this is better off computed inside a Node.

                  uint64_t totalOps = 0;

Contributor

beicy Mar 5, 2019

Just want to double check again here: in the future, do we need to add the computation for each node?

Contributor Author

nrsatish Mar 5, 2019

Yes we do. At least for memory bytes if not flops.

Contributor Author

nrsatish Mar 5, 2019

But for most ops, flops is less important. There are only a handful of ops here that will be at all compute bound.

beicy reviewed

View reviewed changes

lib/Partitioner/Partitioner.cpp

    
                  /// TODO: think about whether this is better off computed inside a Node.

                  uint64_t totalOps = 0;

                  if (node.getKind() == Kinded::Kind::MatMulNodeKind) {

                    auto *MMN = llvm::dyn_cast<MatMulNode>(&node);

Contributor

beicy Mar 5, 2019

I prefer using "switch". If we need to add more node type here, "switch" looks better:)

Contributor Author

nrsatish Mar 5, 2019

Makes sense.

beicy approved these changes

View reviewed changes

Contributor

beicy left a comment

LGTM! Thanks a lot for this work!


          Add data structures for compute and communication time; add function …

77d074c

…to fill in compute and memory bandwidth bound times for ops

nrsatish force-pushed the preprocess-parallel branch from 9e0bc2b to 77d074c Compare

March 6, 2019 01:42

beicy merged commit 74f88b3 into pytorch:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels