-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Add support for rand_like op in fusion compiler #9795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pytorchbot retest this please |
needs rebasing, sorry |
torch/csrc/jit/fusion_compiler.cpp
Outdated
@@ -80,12 +84,297 @@ struct TensorInfo { | |||
IndexType strides[N]; | |||
}; | |||
)"); | |||
constexpr auto rand_support_literal = R"( |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
auto cuda_compilation_unit_template = CodeTemplate(R"( | ||
${type_declarations} | ||
|
||
extern "C" __global__ | ||
void ${kernelName}(IndexType totalElements, ${formals}) { | ||
void ${kernelName}(IndexType totalElements, ${formals} ${RandParam}) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
// well. | ||
if(has_random) { | ||
auto gen_ = THCRandom_getGenerator(at::globalContext().getTHCState()); | ||
uint64_t offset = gen_->state.philox_seed_offset.fetch_add(20); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
compilation_unit = cu.str(); | ||
nvrtcProgram program; | ||
TORCH_NVRTC_CHECK(nvrtcCreateProgram(&program, compilation_unit.c_str(), NULL, 0, nullptr, nullptr)); | ||
|
||
std::string compute = "--gpu-architecture=compute_" + std::to_string(prop.major) + std::to_string(prop.minor); | ||
std::vector<const char *> args = {"--std=c++11", compute.c_str()}; | ||
std::vector<const char *> args = {"--std=c++11", compute.c_str(), "-default-device"}; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
@@ -310,9 +600,11 @@ std::string encodeRHS(Node * n) { | |||
} | |||
|
|||
std::vector<ConcatDesc> emitCompilationUnit(std::ostream & out, | |||
bool& has_random, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really tight PR, I like it a lot! Good to go after the nits are fixed.
cacc6a4
to
a681197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
1db3ae4
to
dae9db0
Compare
@pytorchbot retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good. I have a few questions about how random number seeds are handled. For instance, if we have two calls to rand inside a kernel, it doesn't look like we change the amount we increment the seed accordingly. Can you explain how this works?
torch/csrc/jit/fusion_compiler.cpp
Outdated
if(i == 0) buf = philox(); | ||
uint32 ret = buf[i]; | ||
i = (i + 1) % 4; | ||
static uint32 FLOAT_MASK = (1 << 24) - 1; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
// well. | ||
if(has_random && this->backend() == at::kCUDA) { | ||
auto gen_ = THCRandom_getGenerator(at::globalContext().getTHCState()); | ||
uint64_t offset = gen_->state.philox_seed_offset.fetch_add(20); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
@@ -321,9 +619,12 @@ std::string encodeRHS(Node * n) { | |||
} | |||
|
|||
std::vector<ConcatDesc> emitCompilationUnit(std::ostream & out, | |||
bool* has_random, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
} | ||
PHILOX_DEVICE_INLINE float operator()() { | ||
if(i == 0) buf = philox(); | ||
uint32 ret = buf[i]; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
if(i == 0) buf = philox(); | ||
uint32 ret = buf[i]; | ||
i = (i + 1) % 4; | ||
const uint32 FLOAT_MASK = (1 << 24) - 1; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
005153c
to
32b5a1e
Compare
@pytorchbot retest this please |
torch/csrc/jit/fusion_compiler.cpp
Outdated
@@ -88,11 +94,116 @@ struct TensorInfo { | |||
}; | |||
)"); | |||
|
|||
// The reason why we used TensorFlow's philox implementation is that currently |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/jit/fusion_compiler.cpp
Outdated
}; | ||
|
||
// Constants are picked from https://www.doornik.com/research/randomdouble.pdf | ||
#define M_RAN_INVM32 2.32830643653869628906e-010 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
counter.x += nlo; | ||
if (counter.x < nlo) | ||
nhi++; | ||
counter.y += nhi; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This looks better now, I have a more general question - do we want philox generator as a string literal, or do we want it as a header that's redistributed with binaries and thus can be supplied to nvrtc as a header, not as part of a source file? Pros of going with the header is that when THC generation is moved to use philox too (to match jit) it can reuse this header, otherwise THC will either have to use curand or have a copy-paste of what's in this PR, both of those options are not great IMO. |
32b5a1e
to
73ae53b
Compare
I wasn't quite sure how we shall do the redistribute the header. I think NVRTC either need the path to the header file or the actual string literal rather than using just the include directive. |
73ae53b
to
981a1e5
Compare
981a1e5
to
d802d7a
Compare
I think it does need path to the header file, but pytorch should be able to provide that at runtime if it's part of pytorch binary install? |
I agree it's "better" for the header to be in an actual file, but to do this we have to solve some redistribution problems, where the JIT code doesn't "know" where we installed ATen/TH headers, so what file should it pass to the compiler? (You can hardcode the filepath into the binary, but congratulations, your binary is no longer relocatable). It's just generally easier to make things work if you have the string in the binary. This isn't a fatal problem; for example, you can get the header location from Python and pass it in (like how |
I think I'd prefer merge it as is. |
you have my blessing |
@pytorchbot retest this please |
LGTM. I copied the new Philox implementation and benchmarked it on my Titan V (roughly equivalent to V100). Performance is decent and I'm not observing any pesky local memory use. The values produced match Curand for the same seed, sequence numbers, and offsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests. Pull Request resolved: pytorch#9795 Differential Revision: D8999029 Pulled By: SsnL fbshipit-source-id: f0d2616a699a942e2f370bdb02ac77b9c463d7b8
Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests.