We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent changes in PR #1800 made it possible for multi-output siblings to run out of sync after computeAt pass. Here is a minimal repro:
TEST_F(NVFuserTest, FusionInlineRepro_CUDA) { Fusion fusion; FusionGuard fg(&fusion); TensorView* tv0 = makeContigTensor(2); fusion.addInput(tv0); auto tv1 = set(tv0); auto tvs = Welford(tv1, {1}); auto tvo = set(tvs.var_sum); fusion.addOutput(tvo); tvo->split(0, 16); tvo->axis(1)->parallelize(ParallelType::Unroll); tv0->computeAt(tvo,-1, ComputeAtMode::BestEffort); fusion.printMath(); TORCH_INTERNAL_ASSERT(tvs.var_sum->getComputeAtPosition() == tvs.avg->getComputeAtPosition()); }
snippet from output:
T2_l[ iS21{( ceilDiv(i1, 16) )}, iS22{16}, rS5{i2} ] ca_pos( 2 ) produce_pos( 2)(Avg), T3_l[ iS13{( ceilDiv(i1, 16) )}, iS14{16}, rS7{i2} ] ca_pos( 1 ) produce_pos( 2)(Var), T4_l[ iS19{( ceilDiv(i1, 16) )}, iS20{16}, rS9{i2} ] ca_pos( 2 ) produce_pos( 2)(Count) = Welford ( T1_l[ iS15{( ceilDiv(i1, 16) )}, iS16{16}, iS3{i2} ] ca_pos( 2 )(Avg), allreduce = 0 )
Note the difference in ca_pos on the three outputs.
Out of sync is one trouble, and actually ca_pos of 2 should not be allowed, this would lead to confusion at expression sorting stage.
A possible fix I'm guessing would be to incorporate getMaxPosAll of all siblings when propagating among them.
getMaxPosAll
https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/inline_propagator.cpp#L253
TO BE UPDATED
The text was updated successfully, but these errors were encountered:
zasdfgbnm
Successfully merging a pull request may close this issue.
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Recent changes in PR #1800 made it possible for multi-output siblings to run out of sync after computeAt pass. Here is a minimal repro:
snippet from output:
Note the difference in ca_pos on the three outputs.
Out of sync is one trouble, and actually ca_pos of 2 should not be allowed, this would lead to confusion at expression sorting stage.
A possible fix I'm guessing would be to incorporate
getMaxPosAll
of all siblings when propagating among them.https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/inline_propagator.cpp#L253
Versions
TO BE UPDATED
The text was updated successfully, but these errors were encountered: