Skip to content

Add support for rand_like op in fusion compiler #9795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

runtian-zhou
Copy link

Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests.

@ezyang
Copy link
Contributor

ezyang commented Jul 24, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Jul 25, 2018

needs rebasing, sorry

@@ -80,12 +84,297 @@ struct TensorInfo {
IndexType strides[N];
};
)");
constexpr auto rand_support_literal = R"(

This comment was marked as off-topic.

auto cuda_compilation_unit_template = CodeTemplate(R"(
${type_declarations}

extern "C" __global__
void ${kernelName}(IndexType totalElements, ${formals}) {
void ${kernelName}(IndexType totalElements, ${formals} ${RandParam}) {

This comment was marked as off-topic.

// well.
if(has_random) {
auto gen_ = THCRandom_getGenerator(at::globalContext().getTHCState());
uint64_t offset = gen_->state.philox_seed_offset.fetch_add(20);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

compilation_unit = cu.str();
nvrtcProgram program;
TORCH_NVRTC_CHECK(nvrtcCreateProgram(&program, compilation_unit.c_str(), NULL, 0, nullptr, nullptr));

std::string compute = "--gpu-architecture=compute_" + std::to_string(prop.major) + std::to_string(prop.minor);
std::vector<const char *> args = {"--std=c++11", compute.c_str()};
std::vector<const char *> args = {"--std=c++11", compute.c_str(), "-default-device"};

This comment was marked as off-topic.

This comment was marked as off-topic.

@@ -310,9 +600,11 @@ std::string encodeRHS(Node * n) {
}

std::vector<ConcatDesc> emitCompilationUnit(std::ostream & out,
bool& has_random,

This comment was marked as off-topic.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really tight PR, I like it a lot! Good to go after the nits are fixed.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@runtian-zhou
Copy link
Author

@pytorchbot retest this please

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks good. I have a few questions about how random number seeds are handled. For instance, if we have two calls to rand inside a kernel, it doesn't look like we change the amount we increment the seed accordingly. Can you explain how this works?

if(i == 0) buf = philox();
uint32 ret = buf[i];
i = (i + 1) % 4;
static uint32 FLOAT_MASK = (1 << 24) - 1;

This comment was marked as off-topic.

// well.
if(has_random && this->backend() == at::kCUDA) {
auto gen_ = THCRandom_getGenerator(at::globalContext().getTHCState());
uint64_t offset = gen_->state.philox_seed_offset.fetch_add(20);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@@ -321,9 +619,12 @@ std::string encodeRHS(Node * n) {
}

std::vector<ConcatDesc> emitCompilationUnit(std::ostream & out,
bool* has_random,

This comment was marked as off-topic.

}
PHILOX_DEVICE_INLINE float operator()() {
if(i == 0) buf = philox();
uint32 ret = buf[i];

This comment was marked as off-topic.

if(i == 0) buf = philox();
uint32 ret = buf[i];
i = (i + 1) % 4;
const uint32 FLOAT_MASK = (1 << 24) - 1;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@runtian-zhou
Copy link
Author

@pytorchbot retest this please

@@ -88,11 +94,116 @@ struct TensorInfo {
};
)");

// The reason why we used TensorFlow's philox implementation is that currently

This comment was marked as off-topic.

};

// Constants are picked from https://www.doornik.com/research/randomdouble.pdf
#define M_RAN_INVM32 2.32830643653869628906e-010

This comment was marked as off-topic.

counter.x += nlo;
if (counter.x < nlo)
nhi++;
counter.y += nhi;

This comment was marked as off-topic.

@ngimel
Copy link
Collaborator

ngimel commented Jul 31, 2018

This looks better now, I have a more general question - do we want philox generator as a string literal, or do we want it as a header that's redistributed with binaries and thus can be supplied to nvrtc as a header, not as part of a source file? Pros of going with the header is that when THC generation is moved to use philox too (to match jit) it can reuse this header, otherwise THC will either have to use curand or have a copy-paste of what's in this PR, both of those options are not great IMO.

@runtian-zhou
Copy link
Author

I wasn't quite sure how we shall do the redistribute the header. I think NVRTC either need the path to the header file or the actual string literal rather than using just the include directive.

@ngimel
Copy link
Collaborator

ngimel commented Jul 31, 2018

I think it does need path to the header file, but pytorch should be able to provide that at runtime if it's part of pytorch binary install?

@ezyang
Copy link
Contributor

ezyang commented Aug 1, 2018

I agree it's "better" for the header to be in an actual file, but to do this we have to solve some redistribution problems, where the JIT code doesn't "know" where we installed ATen/TH headers, so what file should it pass to the compiler? (You can hardcode the filepath into the binary, but congratulations, your binary is no longer relocatable). It's just generally easier to make things work if you have the string in the binary.

This isn't a fatal problem; for example, you can get the header location from Python and pass it in (like how torch/utils/cpp_extension.py does it). It's just more work. @zrt95, it's up to you; if you want to merge the PR as is to start with I'm OK with that.

@runtian-zhou
Copy link
Author

I think I'd prefer merge it as is.

@ezyang
Copy link
Contributor

ezyang commented Aug 1, 2018

you have my blessing

@ssnl
Copy link
Collaborator

ssnl commented Aug 1, 2018

@pytorchbot retest this please

@mcarilli
Copy link
Collaborator

mcarilli commented Aug 1, 2018

LGTM. I copied the new Philox implementation and benchmarked it on my Titan V (roughly equivalent to V100). Performance is decent and I'm not observing any pesky local memory use. The values produced match Curand for the same seed, sequence numbers, and offsets.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests.
Pull Request resolved: pytorch#9795

Differential Revision: D8999029

Pulled By: SsnL

fbshipit-source-id: f0d2616a699a942e2f370bdb02ac77b9c463d7b8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants