Skip to content

Nvfuser code bump 030122 #73627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 690 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
690 commits
Select commit Hold shift + click to select a range
157c57b
Print smem error info (#1157)
naoyam Sep 28, 2021
68dec55
Use WARP_SIZE instead of 32 (#1158)
naoyam Sep 28, 2021
9ebcb2a
Prevent unused variable warning (#1159)
naoyam Sep 29, 2021
21884ea
Fix computation of thread predicate with broadcast (#1163)
naoyam Oct 1, 2021
bc98e3c
[WIP] Channels last refactor (#1118)
jjsjann123 Oct 2, 2021
211185f
Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137) (#1170)
jjsjann123 Oct 5, 2021
e8ecaa3
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Oct 6, 2021
22015e4
Missed pr from upstream merge (#1175)
csarofeen Oct 7, 2021
28dc2a1
Explicitly track loops and IterDomains that do not contribute to inde…
naoyam Oct 7, 2021
d5c8abe
Change indexing and predication to address non-exact threading dimens…
naoyam Oct 8, 2021
9f3ebec
Allow setting contiguity of tensors (#1161)
naoyam Oct 8, 2021
88c7ee6
Place unswitched shared memory allocations outside of unswitched doma…
naoyam Oct 8, 2021
b4b1954
range-based for loop (#1176)
jjsjann123 Oct 8, 2021
8d973d2
cleanup (#1177)
naoyam Oct 8, 2021
7be3f85
Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indi…
jjsjann123 Oct 8, 2021
b6e1b10
Detect parallelization with predicated parallel types (#1166)
naoyam Oct 8, 2021
8cb7129
[JIT] Initialize CUDA context before launching fused kernel (#65064) …
jjsjann123 Oct 8, 2021
239621f
Inline thread predicates even when unswitched (#1174)
naoyam Oct 9, 2021
38d1870
softmax/backward dtype argument support (#1180)
jjsjann123 Oct 11, 2021
bb37524
Do not predicate non-exact parallel domains when generating unswitch …
naoyam Oct 11, 2021
3c84a33
adding removeInplaceOperations pass for nvfuser (#1186)
jjsjann123 Oct 12, 2021
dd8ed0b
Fix predicates and indexing for vectorization with unswitch and unrol…
csarofeen Oct 12, 2021
bebe584
Merging the shift-specific predicate logic into the main predicate lo…
naoyam Oct 12, 2021
bee312c
Disable NVTX recording with PYTORCH_NVFUSER_DIABLE_NVTX (#1192)
naoyam Oct 12, 2021
f10afcd
issue 1189 repro and fix (#1193)
shmsong Oct 13, 2021
c5cd42d
Parser refactor (#1191)
jjsjann123 Oct 13, 2021
47e33bf
Type Promotion and Special Number Test Cases (#1188)
rdspring1 Oct 13, 2021
5154c53
Perf Tuning and Schedulers refactor (#1073)
csarofeen Oct 14, 2021
ffb9b24
Benchmark refactoring, add backwards benchmarks. (#1190)
csarofeen Oct 14, 2021
bed9edc
Benchmark fix for warp reduced softmax (#1195)
shmsong Oct 15, 2021
851c2fc
Fix Issue #1201 - __bfloat2float error (#1202)
rdspring1 Oct 18, 2021
6279ee1
Fix Threshold and Clamp Type Promotion (#1168)
csarofeen Oct 18, 2021
5769022
Minor fixes and warp padding propagation in parallelize all like. (#1…
csarofeen Oct 18, 2021
ab3fda7
Merge commit '7baf4f6b12bbd05aba2baea6534eae4ca8f2982b' into HEAD
jjsjann123 Oct 19, 2021
06a7303
initialize registry before accessing! (#1206)
jjsjann123 Oct 19, 2021
fa5a163
Merge pull request #1204 from csarofeen/devel_master_update_10_11_21
csarofeen Oct 19, 2021
28bdce1
Cuda fusion guard with profile ivalue (#1197)
jjsjann123 Oct 21, 2021
78641ef
Bn fp16 io alias update (#1207)
jjsjann123 Oct 21, 2021
c85126d
Fix dependency check in reduction schedulers involving welford ops (#…
shmsong Oct 21, 2021
d4d68bc
Fix expr sorting and loop nest generation. (#1209)
csarofeen Oct 22, 2021
eb718e6
Rewrite BN backwards to be 2 reduction approach, not 4. (#1211)
csarofeen Oct 22, 2021
3fe6949
change WelfordLargeNormalization test (#1214)
shmsong Oct 22, 2021
23e3f6e
Validate grid reduction predication (#1215)
naoyam Oct 22, 2021
1826a87
Minor sort refactor in reduction_utils.cpp (#1216)
csarofeen Oct 26, 2021
efeda74
Support TensorIndex outputs of kir::Expr (#1224)
naoyam Oct 26, 2021
2489ab9
Update Type Promotion Rules (#1217)
rdspring1 Oct 28, 2021
5a93f93
Change TensorDomain contiguity to be getMaybeRFactorDomain size (#1196)
rdspring1 Oct 28, 2021
967c0cc
Fix negative position in reducitonop (#1231)
naoyam Oct 29, 2021
a8e8d4b
Type Promotion Fix (#1236)
rdspring1 Oct 29, 2021
347d2a9
Add predicates for thread dimensions in unswitched predicates (#1222)
naoyam Oct 29, 2021
750ebd5
Add fp16/fp32 autocasting to JIT/TorchScript (#63939) (#1242)
jjsjann123 Nov 1, 2021
aef12b5
fixing bfloat16 test failures (#1246)
jjsjann123 Nov 2, 2021
3acb5a4
Merge commit 'cd51d2a3ecc8ac579bee910f6bafe41a4c41ca80' into HEAD
jjsjann123 Nov 3, 2021
f94c087
Merge pull request #1248 from csarofeen/master_merge_1101
csarofeen Nov 3, 2021
c40bcc1
Minor changes to benchmarks (#1232)
csarofeen Nov 3, 2021
7edd2ac
fixing _batch_norm_impl_index(_backward) in shape expression (#1228)
jjsjann123 Nov 3, 2021
2689322
fixing removeOutputUsedOnlyInDtype pass (#1227)
jjsjann123 Nov 4, 2021
9a6ae64
code cleaning (#1251)
jjsjann123 Nov 4, 2021
6df8783
Print full fusion in segmenter debug. (#1235)
csarofeen Nov 4, 2021
055ca94
Vectorization detection fix in schedulers (#1249)
csarofeen Nov 4, 2021
a7cfa21
Quick fixes on linear split to `matmul` + `add_optional` (#1253)
jjsjann123 Nov 4, 2021
a59334d
Predicate reference IterDomains that are mapped with consumer root do…
naoyam Nov 4, 2021
7c9d1ac
Fix/Improve Persistent buffer computation (#1237)
csarofeen Nov 4, 2021
ad82dfa
Add reset exprs util in SegmentedGroup (#1219)
shmsong Nov 4, 2021
c235daf
Code changed, update C++ tests. (#1256)
csarofeen Nov 4, 2021
310bab8
conv2d passes added to separate bias add (#1226)
jjsjann123 Nov 4, 2021
deeff39
Always schedule some parallelization on iter domain in reductions for…
csarofeen Nov 5, 2021
7fe0b7b
format and warning cleanup (#1220)
shmsong Nov 8, 2021
ba24a0f
Initial support of strided gather (#1262)
naoyam Nov 10, 2021
b8dfd8c
Code bump 11 5 clean up (#1263)
jjsjann123 Nov 10, 2021
34322b1
Minor fix to rfactor, we can rfactor trivial reductions now, which is…
csarofeen Nov 10, 2021
b9e0f74
Refactor grid and block reductions. (#1267)
csarofeen Nov 11, 2021
49d91ee
Fix loop lowering and expr sorting, by making sure loop dependencies …
csarofeen Nov 16, 2021
75261f6
Rewrite grid synchronization (#1260)
csarofeen Nov 16, 2021
317bcd2
Cross-block persistent support in codegen (#1268)
csarofeen Nov 16, 2021
8fcef3b
Remove assert in blockWelford (#1273)
naoyam Nov 16, 2021
cf61f34
Make non-divisible splits not change extents used in indexing (#1270)
naoyam Nov 17, 2021
7ca3758
View Support - Cpp Only (#1245)
rdspring1 Nov 18, 2021
a87821d
Sibling fusion pr (#1278)
jjsjann123 Nov 19, 2021
b620917
fixing removeProfilingNodes duplicated functions (#1282)
jjsjann123 Nov 23, 2021
6b2f316
horizontal fusion patch (#1283)
jjsjann123 Nov 23, 2021
43e4f8a
Arbitrary permutation support in codegen integration (#1271)
jjsjann123 Nov 23, 2021
e251b0a
Nvfuser code bump 11 5 (#67943) (#1285)
jjsjann123 Nov 23, 2021
976c8d9
Native dropout cherry pick (#1286)
jjsjann123 Nov 24, 2021
68040a1
Adding parsing of threshold_backward and _softmax for LTC (#1288)
kevinstephano Nov 30, 2021
992916d
revert autodiff add_0 changes and tests (#1287)
jjsjann123 Dec 1, 2021
1f5702b
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Dec 1, 2021
67fa0f0
Merge pull request #1290 from csarofeen/devel_master_bump_11_30
csarofeen Dec 2, 2021
d9710d6
Fix keep_dim with negative positions (#1294)
naoyam Dec 3, 2021
1002379
Fallback for kernel expr eval. (#1298)
csarofeen Dec 6, 2021
63e10b9
Disallow welford in normalization scheduler (#1297)
csarofeen Dec 6, 2021
1145c12
Generate predicate expressions using consumers (#1300)
naoyam Dec 7, 2021
9224858
Recompile for register usage (#1296)
csarofeen Dec 7, 2021
c3777cd
Make the reduction work buffer volatile (#1301)
naoyam Dec 8, 2021
062e72a
Reduction scheduler refactor (#1299)
csarofeen Dec 8, 2021
b25d182
PYTORCH_NVFUSER_ONE_OP_FUSION=1 will take all nodes nvFuser supports,…
jjsjann123 Dec 8, 2021
e408152
clang-format (#1303)
naoyam Dec 8, 2021
1f55fc4
View Support - Python (#1261)
rdspring1 Dec 8, 2021
ff009fb
Codegen fixes and test patches for pre-volta device (#1304)
jjsjann123 Dec 12, 2021
dada9aa
code sanitization cherry-picked from upstream push (#1295)
jjsjann123 Dec 13, 2021
00e297a
Remove the option to use Int* as gather window sizes (#1307)
naoyam Dec 13, 2021
09a9438
Simplify gather predicate generation. (#1308)
naoyam Dec 13, 2021
e01e5bf
Disable fast math (#1323)
csarofeen Dec 15, 2021
eeb4d0c
Explicitly track all unmappable dims in compute at. (#1324)
csarofeen Dec 15, 2021
e2f287a
Make dispatch of KIR and FusionIR more similar. (#1314)
csarofeen Dec 15, 2021
578e6a9
Fix test names. (#1329)
csarofeen Dec 15, 2021
7e84e15
add missing terminating " character (#1330)
crcrpar Dec 16, 2021
541bd77
Type Promotion Fixes (#1322)
rdspring1 Dec 16, 2021
1a616d9
Implement a visitor like class for KIR, move passes to it. (#1332)
csarofeen Dec 17, 2021
ef62e4e
fixing conv2d decomposition and tests (#1333)
jjsjann123 Dec 18, 2021
2158dba
Create mutator class for kir and refactor passes (#1336)
csarofeen Dec 21, 2021
59cbf76
Refactor get allocation information in lower_utils (#1337)
csarofeen Dec 21, 2021
2672320
Alias copy patch (#1338)
jjsjann123 Dec 21, 2021
b308af2
Add rsub for functorch support. (#1342)
kevinstephano Dec 23, 2021
f236ee9
Fixes patches from ltc aot autograd etc (#1340)
jjsjann123 Dec 23, 2021
9fb69ab
Nvfuser code bump 12 5 (#69964) (#1345)
jjsjann123 Jan 1, 2022
be3267d
Fix segfault. (#1357)
csarofeen Jan 6, 2022
d29fb48
Collection of refactoring in nvFuser lowering (#1339)
csarofeen Jan 6, 2022
24313d9
Clang format (#1360)
naoyam Jan 6, 2022
2c40949
Support more flexible padding sizes in shift and gather (#1334)
naoyam Jan 7, 2022
9e0c9af
clang-tidy (#1363)
naoyam Jan 7, 2022
7ce469c
Print CA info only when FIR (#1364)
naoyam Jan 7, 2022
99be762
Transposing scalar tensor patch (#1361)
jjsjann123 Jan 7, 2022
850200c
Build error fix (and clang-format) (#1368)
naoyam Jan 8, 2022
4f6c999
Fixes #1310 - alias_copy assertion in fallback path (#1335)
rdspring1 Jan 8, 2022
34ac15d
Segment independent component on fusion graph (#1370)
shmsong Jan 18, 2022
d78a0c4
Avoid constructing a new TV in parsing. (#1374)
csarofeen Jan 19, 2022
a2a0f54
Remove Kernel IR join infrastructure with Fusion (#1373)
csarofeen Jan 19, 2022
6b66dce
clang-format (#1394)
naoyam Jan 19, 2022
589cbca
Pass inputs to compileFusion to avoid redundant compilation (#1395)
naoyam Jan 20, 2022
0da82c4
Double buffering support (#1381)
naoyam Jan 20, 2022
39082d7
Some minor fixes. (#1401)
csarofeen Jan 21, 2022
e4aa436
Graph lint error patch (#1378)
jjsjann123 Jan 21, 2022
5a4716a
new clang-format binary hash (#1398)
shmsong Jan 22, 2022
5f8de6f
Verify vectorization eligibility for intermediate tensors (#1402)
naoyam Jan 24, 2022
08559e7
Add nullptr protection. (#1407)
csarofeen Jan 24, 2022
d730b56
Merge commit '17540c5c80f5c6cd4e0fee42ec47d881e46f47f9' into HEAD
jjsjann123 Jan 25, 2022
6c9aacf
Fix non-unrolled reduction scheduling. (#1409)
csarofeen Jan 26, 2022
bb70ddc
Detect non-concretized broadcast domains (#1412)
naoyam Jan 28, 2022
a638157
support device promotion on scalar tensor in nvfuser (#1400)
jjsjann123 Jan 28, 2022
f1c943f
Merge remote-tracking branch 'csarofeen/devel' into HEAD
jjsjann123 Jan 30, 2022
d3e5eb1
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Jan 31, 2022
a5d44ac
Merge remote-tracking branch 'upstream/master' into upstream_master_b…
jjsjann123 Jan 31, 2022
f9f20c7
Add tanh-backward support (#1420)
rdspring1 Feb 1, 2022
8627b94
Merge branch 'devel' of https://www.github.com/csarofeen/pytorch into…
csarofeen Feb 1, 2022
78c82c5
Merge pull request #1414 from csarofeen/upstream_master_bump_012622
csarofeen Feb 1, 2022
b88b0ea
Set nondet_tol to 1e-5 for gradcheck in test_unary_ops (#1423)
rdspring1 Feb 1, 2022
e0082e7
Add log-softmax, mean, var, and std operations (#1417)
rdspring1 Feb 3, 2022
54df0ed
fixing stride order for expanded tensor (#71665) (#1431)
jjsjann123 Feb 3, 2022
d0e47f0
Additional type promotion tests involving cpu scalars (#1415)
shmsong Feb 3, 2022
e5df130
Add basic support for dtype ComplexFloat, ComplexDouble (#1427)
zasdfgbnm Feb 7, 2022
9233f91
Cleans up some of the old IR transformation code with kir::ExprMutato…
naoyam Feb 7, 2022
35c1704
Avoid some unnecessary predicates. (#1429)
csarofeen Feb 7, 2022
5069bb3
Map everything between multiple outputs even for the CA Parallel Map …
naoyam Feb 7, 2022
fd935ef
RMSNorm with tests/benchmarking (shapes based off of HuggingFace T5 o…
eqy Feb 8, 2022
4e7ff71
Do not inline allocated scalars (#1434)
naoyam Feb 8, 2022
16df2b8
print 0-dim tensors as tensors (#1442)
naoyam Feb 8, 2022
3c9c1f1
Index Hoisting (#1426)
naoyam Feb 9, 2022
9bcc35a
Move welford to use nvfuser_index_t, pipe it through as a compile tim…
csarofeen Feb 9, 2022
d7635d0
Use ParallelMap in expr sorting (#1436)
csarofeen Feb 9, 2022
498a00b
Extend use of SimplyfingIrBuilder (#1448)
naoyam Feb 9, 2022
2d979a2
derive heuristics in intermediate reduction groups (#1447)
shmsong Feb 10, 2022
8d37c89
View copy patch (#1450)
jjsjann123 Feb 10, 2022
cec1d9d
patching dtype for int32 and bool casting (#1449)
jjsjann123 Feb 10, 2022
c65ee8b
test fix (#1456)
naoyam Feb 10, 2022
966fc25
Fix input vectorization in the pointwise scheduler. (#1459)
csarofeen Feb 10, 2022
44e8c15
Rework vectorized load/stores. (#1457)
csarofeen Feb 11, 2022
40833b3
Fix 896 and 1446 (#1461)
csarofeen Feb 11, 2022
f9af139
Refactoring of loop materialization (#1452)
naoyam Feb 11, 2022
3134d3a
Index hoist follow-up (#1458)
naoyam Feb 11, 2022
0cb4552
Remove debug print (#1463)
naoyam Feb 11, 2022
e36a9fa
fixing parsing rule for empty reduction axes (#1454)
jjsjann123 Feb 11, 2022
5528331
Issue 1444 (#1462)
csarofeen Feb 11, 2022
592fd47
disabling 0-dim cuda tensor reduction/normalization (#1453)
jjsjann123 Feb 12, 2022
e634139
Do not do special handling for broadcast domains in parallel dimensio…
naoyam Feb 15, 2022
409081c
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Feb 16, 2022
4a67e67
Type inference bug fix (#1443)
zasdfgbnm Feb 16, 2022
63d7bba
Add complex scalar support (#1433)
zasdfgbnm Feb 17, 2022
37e8fe5
disabling reduction fusion on size-0 tensors (#1469)
jjsjann123 Feb 22, 2022
fd941ba
Assert on zero size dimension. (#1470)
csarofeen Feb 22, 2022
91ad149
Axes patch (#1476)
jjsjann123 Feb 22, 2022
5c9c01e
patching aten failures on python tests (#1475)
jjsjann123 Feb 22, 2022
32e44c9
Fix type computation for complex abs (#1482)
zasdfgbnm Feb 23, 2022
1716aff
Int bool tensor support (#1479)
jjsjann123 Feb 23, 2022
e29574a
Fix RAW placement in outer most scope. (#1474)
csarofeen Feb 23, 2022
f4b2b64
Fixing codegen of fused warp reduce and broadcast (#1483)
naoyam Feb 23, 2022
573fecc
Add tensor.view(dtype) overload support (#1481)
zasdfgbnm Feb 23, 2022
1b916ac
Enable some C++ tests for complex (#1472)
zasdfgbnm Feb 23, 2022
80c140a
Minor opinfo fixes (#1478)
jjsjann123 Feb 23, 2022
9615362
Casted alias (#1480)
jjsjann123 Feb 23, 2022
fa12901
Remove some NOLINTNEXTLINE (#1485)
zasdfgbnm Feb 23, 2022
5859c76
Squeeze scalar tensor (#1489)
jjsjann123 Feb 24, 2022
4dc7e39
quick patch to drop aliased output from pushing to stack (#1471)
jjsjann123 Feb 24, 2022
2466430
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Feb 24, 2022
5fdf2f9
Debug improvements (#1328)
jjsjann123 Feb 24, 2022
09495bb
Parser update on size 0 (#1490)
jjsjann123 Feb 25, 2022
8180436
Minor cleanup in runtime files. (#1465)
csarofeen Feb 26, 2022
2d08fce
Fix vector reset for double buffered tensor on registers (#1491)
shmsong Feb 26, 2022
7fcec1a
WAR For Issue #1487 (#1492)
csarofeen Feb 26, 2022
fca0186
Minor fix for trivial reductions. (#1496)
csarofeen Feb 28, 2022
8ee6e92
Fix concrete domain selection with view rfactor domains (#1494)
naoyam Feb 28, 2022
3b5364f
Fixes to enable view fusion in LTC (#1451)
rdspring1 Feb 28, 2022
4796cd3
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Mar 1, 2022
96e3160
Merge remote-tracking branch 'origin/devel' into HEAD
jjsjann123 Mar 1, 2022
ba2f501
Reshape fix (#1499)
jjsjann123 Mar 1, 2022
b02fbbd
Merge pull request #1468 from csarofeen/upstream_master_bump_021622
csarofeen Mar 1, 2022
e4dde70
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Mar 2, 2022
293a225
code cleaning
jjsjann123 Mar 2, 2022
c345c14
fixing ci failures
jjsjann123 Mar 2, 2022
2088e09
Merge remote-tracking branch 'upstream/master' into nvfuser_code_bump…
jjsjann123 Mar 2, 2022
e92d791
clang-format
jjsjann123 Mar 2, 2022
2451f0a
adding native_batch_norm_backward into parser (#1501)
jjsjann123 Mar 3, 2022
922a480
fixing suffix for cuda tests
jjsjann123 Mar 4, 2022
0e3d897
Merge remote-tracking branch 'upstream/master' into nvfuser_code_bump…
jjsjann123 Mar 4, 2022
41543ee
Minor fix on reference replay (#1505)
shmsong Mar 7, 2022
34f8eb9
InstanceNorm Channels Last 3D Benchmarks + InstanceNormBackward (#1438)
eqy Mar 8, 2022
8ad176e
Merge remote-tracking branch 'upstream/master' into nvfuser_code_bump…
jjsjann123 Mar 9, 2022
8b9e069
Merge remote-tracking branch 'csarofeen/devel' into nvfuser_code_bump…
jjsjann123 Mar 9, 2022
54ee465
revert CI workflow changes
jjsjann123 Mar 9, 2022
5c814c7
Global memory communication (#1484)
csarofeen Mar 9, 2022
a38a920
Fused grid reduction and broadcast (#1495)
naoyam Mar 11, 2022
3994af8
Change default gelu to use erf() instead of normcdf() (#1513)
rdspring1 Mar 14, 2022
37e4060
Print strides of input and output tensors (#1520)
naoyam Mar 15, 2022
69b760f
fixing contiguous on broadcasted dimension (#1519)
jjsjann123 Mar 15, 2022
70ce86a
Disallow passing immediate scalars as inputs (#1522)
naoyam Mar 16, 2022
4dec550
Fixing test_linear to use a 3-D input tensor instead of 2-D one. The…
kevinstephano Mar 16, 2022
f65601a
Select output reference tensor correctly for pointwise scheduler (#1500)
rdspring1 Mar 16, 2022
21d0ac6
Make sure WAR syncs for gather ops are properly generated. (#1516)
csarofeen Mar 17, 2022
3191bd3
Relaxing validation of aligned vectorization (part 1) (#1508)
naoyam Mar 17, 2022
9522b1b
Disable index hoisting with PYTORCH_NVFUSER_DISABLE_INDEX_HOIST (#1526)
naoyam Mar 18, 2022
5ba9343
Fixes #1523 (#1525)
naoyam Mar 18, 2022
8eb5e9a
fixing top level CMakeLists.txt
jjsjann123 Mar 21, 2022
29693a6
moving test/cpp/jit/[nvfuser_tests] to torch/csrc/jit/codegen/cuda/te…
jjsjann123 Mar 21, 2022
4f88085
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Mar 21, 2022
6df7b77
Mma operator and volta mma integration (#1439)
shmsong Mar 21, 2022
dc28c22
Merge remote-tracking branch 'csarofeen/devel' into HEAD
jjsjann123 Mar 21, 2022
dd6d838
Fix Issue 1502 (#1503)
rdspring1 Mar 22, 2022
9454c5d
hot-patching fused_reduction string length
jjsjann123 Mar 22, 2022
2b96b42
Merge remote-tracking branch 'origin/devel' into HEAD
jjsjann123 Mar 22, 2022
2a433e5
Revert "hot-patching fused_reduction string length"
jjsjann123 Mar 22, 2022
d5589fd
concatenation in stringify
jjsjann123 Mar 22, 2022
6c08642
lint
jjsjann123 Mar 22, 2022
506c436
flake8
jjsjann123 Mar 22, 2022
6bccf87
Merge remote-tracking branch 'upstream/master' into nvfuser_code_bump…
jjsjann123 Mar 22, 2022
a8c1bdb
fixing build with include quote
jjsjann123 Mar 22, 2022
5af17a1
silence clang-tidy warning on goto from gtest
jjsjann123 Mar 22, 2022
63d8a8f
clang-tidy
jjsjann123 Mar 22, 2022
7d97b52
clang-format
jjsjann123 Mar 22, 2022
4f19cbb
clang-tidy one last time
jjsjann123 Mar 22, 2022
4537f6c
Merge remote-tracking branch 'upstream/master' into HEAD
jjsjann123 Mar 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions benchmarks/cpp/nvfuser/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ if(USE_CUDA)
instance_norm.cpp
layer_norm.cpp
layer_norm_backward.cpp
rms_norm.cpp
rms_norm_backward.cpp
lstm_cell.cpp
reduction.cpp
softmax.cpp
Expand Down
138 changes: 119 additions & 19 deletions benchmarks/cpp/nvfuser/instance_norm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,18 @@

using namespace torch::jit::fuser::cuda;

static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
static void setupInstanceNorm(
Fusion* fusion,
DataType dtype,
bool channels_last_3d = false) {
TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);

FusionGuard fg(fusion);

auto input = makeContigTensor(4, dtype);
if (channels_last_3d) {
input = makeContigTensor(5, dtype);
}
auto weight = makeContigTensor(1, dtype);
auto bias = makeContigTensor(1, dtype);
auto running_mean = makeContigTensor(1, DataType::Float);
Expand Down Expand Up @@ -51,7 +57,8 @@ static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
running_var,
kTraining,
momentum_ptr,
eps_ptr);
eps_ptr,
channels_last_3d);

auto output = unaryOp(UnaryOpType::Relu, norm.output);

Expand All @@ -67,7 +74,8 @@ static void setupInstanceNorm(Fusion* fusion, DataType dtype) {
static void NvFuserScheduler_InstanceNorm(
benchmark::State& benchmark_state,
FusionExecutorCache* fusion_executor_cache,
DataType dtype) {
DataType dtype,
bool channels_last_3d = false) {
TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);

std::vector<int64_t> input_shape{
Expand All @@ -76,27 +84,37 @@ static void NvFuserScheduler_InstanceNorm(
benchmark_state.range(1),
benchmark_state.range(1)};

std::vector<int64_t> input_shape_3d{
benchmark_state.range(0),
benchmark_state.range(1),
benchmark_state.range(1),
benchmark_state.range(1),
benchmark_state.range(2)};

// inputs
at::manual_seed(0);
auto options =
at::TensorOptions().dtype(data_type_to_aten(dtype)).device(at::kCUDA, 0);
auto fp32_options =
at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor at_x = at::randn(input_shape, options);
at::Tensor at_weight = at::ones({input_shape[1]}, options);
at::Tensor at_bias = at::zeros({input_shape[1]}, options);
at::Tensor at_mean = at::zeros({input_shape[1]}, fp32_options);
at::Tensor at_var = at::ones({input_shape[1]}, fp32_options);
at::Tensor at_x =
at::randn(channels_last_3d ? input_shape_3d : input_shape, options);
at::Tensor at_weight = at::ones({benchmark_state.range(2)}, options);
at::Tensor at_bias = at::zeros({benchmark_state.range(2)}, options);
at::Tensor at_mean = at::zeros({benchmark_state.range(2)}, fp32_options);
at::Tensor at_var = at::ones({benchmark_state.range(2)}, fp32_options);

std::vector<c10::IValue> aten_inputs = {
at_x, at_weight, at_bias, at_mean, at_var};
std::vector<at::Tensor> outputs;

runBenchmarkIterations(benchmark_state, fusion_executor_cache, aten_inputs);

const size_t kSize =
input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
const size_t kChannels = input_shape[1];
const size_t kSize = channels_last_3d
? input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] *
input_shape[4]
: input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
const size_t kChannels = benchmark_state.range(2);

// Read: x, weight, bias, running_mean, running_var
// Write: y, running_mean, running_var
Expand All @@ -108,14 +126,23 @@ static void NvFuserScheduler_InstanceNorm(

static void Baseline_InstanceNorm(
benchmark::State& benchmark_state,
DataType dtype) {
DataType dtype,
bool channels_last_3d = false) {
TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);

std::vector<int64_t> input_shape{
benchmark_state.range(0),
benchmark_state.range(2),
benchmark_state.range(1),
benchmark_state.range(1)};
std::vector<int64_t> input_shape_3d{
benchmark_state.range(0),
benchmark_state.range(2),
benchmark_state.range(1),
benchmark_state.range(1),
benchmark_state.range(1),
};

const float kMomentum = 0.1;
const float kEps = 1e-5;
const auto aten_dtype = data_type_to_aten(dtype);
Expand All @@ -126,10 +153,15 @@ static void Baseline_InstanceNorm(
at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);

at::Tensor at_x = at::randn(input_shape, options);
at::Tensor at_weight = at::ones({input_shape[1]}, options);
at::Tensor at_bias = at::zeros({input_shape[1]}, options);
at::Tensor at_mean = at::zeros({input_shape[1]}, fp32_options);
at::Tensor at_var = at::ones({input_shape[1]}, fp32_options);
if (channels_last_3d) {
at_x = at::randn(
input_shape_3d,
options.memory_format(c10::MemoryFormat::ChannelsLast3d));
}
at::Tensor at_weight = at::ones({benchmark_state.range(2)}, options);
at::Tensor at_bias = at::zeros({benchmark_state.range(2)}, options);
at::Tensor at_mean = at::zeros({benchmark_state.range(2)}, fp32_options);
at::Tensor at_var = at::ones({benchmark_state.range(2)}, fp32_options);

auto ato_weight = c10::optional<at::Tensor>(at_weight);
auto ato_bias = c10::optional<at::Tensor>(at_bias);
Expand Down Expand Up @@ -159,9 +191,11 @@ static void Baseline_InstanceNorm(
cudaDeviceSynchronize();
}

const size_t kSize =
input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
const size_t kChannels = input_shape[1];
const size_t kSize = channels_last_3d
? input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] *
input_shape[4]
: input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
const size_t kChannels = benchmark_state.range(2);

// Read: x, weight, bias, running_mean, running_var
// Write: y, running_mean, running_var
Expand All @@ -181,6 +215,11 @@ static void Baseline_InstanceNorm_fp16(benchmark::State& benchmark_state) {
Baseline_InstanceNorm(benchmark_state, DataType::Half);
}

static void Baseline_InstanceNorm_fp32_channels_last_3d(
benchmark::State& benchmark_state) {
Baseline_InstanceNorm(benchmark_state, DataType::Float, true);
}

//------------------------------------------------------------------------------

NVFUSER_BENCHMARK_DEFINE(
Expand All @@ -195,6 +234,43 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm_fp32)
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_DEFINE(
NvFuserScheduler_InstanceNorm3d_channels_last_fp32,
setupInstanceNorm,
NvFuserScheduler_InstanceNorm,
DataType::Float,
true);

NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
->RangeMultiplier(2)
->Ranges({{1, 8}, {128, 128}, {32, 32}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
->RangeMultiplier(2)
->Ranges({{1, 8}, {64, 64}, {64, 64}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
->RangeMultiplier(2)
->Ranges({{1, 8}, {32, 32}, {128, 128}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
->RangeMultiplier(2)
->Ranges({{1, 8}, {16, 16}, {256, 256}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_RUN(NvFuserScheduler_InstanceNorm3d_channels_last_fp32)
->RangeMultiplier(2)
->Ranges({{1, 8}, {4, 8}, {320, 320}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

NVFUSER_BENCHMARK_DEFINE(
NvFuserScheduler_InstanceNorm_fp16,
setupInstanceNorm,
Expand All @@ -220,4 +296,28 @@ BENCHMARK(Baseline_InstanceNorm_fp16)
->Unit(benchmark::kMicrosecond)
->UseManualTime();

BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
->RangeMultiplier(2)
->Ranges({{2, 8}, {128, 128}, {32, 32}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
->RangeMultiplier(2)
->Ranges({{2, 8}, {64, 64}, {64, 64}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
->RangeMultiplier(2)
->Ranges({{2, 8}, {16, 16}, {256, 256}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

BENCHMARK(Baseline_InstanceNorm_fp32_channels_last_3d)
->RangeMultiplier(2)
->Ranges({{2, 8}, {4, 8}, {320, 320}})
->Unit(benchmark::kMicrosecond)
->UseManualTime();

//------------------------------------------------------------------------------
4 changes: 2 additions & 2 deletions benchmarks/cpp/nvfuser/layer_norm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ static void setupLayerNorm(Fusion* fusion, DataType dtype) {

auto output = layer_norm_results.output;

if (dtype == DataType::Half) {
output = castOp(DataType::Half, output);
if (dtype != DataType::Float) {
output = castOp(dtype, output);
}

fusion->addOutput(output);
Expand Down
9 changes: 4 additions & 5 deletions benchmarks/cpp/nvfuser/layer_norm_backward.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,12 @@ static void setupLayerNorm_BWD(Fusion* fusion, DataType dtype) {
auto layer_norm_results = layer_norm_backward(
grad_out, input, {1}, mean, rstd, weight, bias, {true, true, true});

if (dtype == DataType::Half) {
if (dtype != DataType::Float) {
layer_norm_results.grad_input =
castOp(DataType::Half, layer_norm_results.grad_input);
layer_norm_results.grad_bias =
castOp(DataType::Half, layer_norm_results.grad_bias);
castOp(dtype, layer_norm_results.grad_input);
layer_norm_results.grad_bias = castOp(dtype, layer_norm_results.grad_bias);
layer_norm_results.grad_weight =
castOp(DataType::Half, layer_norm_results.grad_weight);
castOp(dtype, layer_norm_results.grad_weight);
}

fusion->addOutput(layer_norm_results.grad_input);
Expand Down
Loading