slow test failure: test_nvfuser_extremal_values_masked_amin_cuda_float32

### 🐛 Describe the bug

To reproduce:
```
PYTORCH_TEST_WITH_SLOW=1 python test/test_jit_cuda_fuser.py -v -k test_nvfuser_extremal_values_masked_amin_cuda_float32
```

Fusion math is:
```
Inputs:
  T0_g[ iS0{i1}, iS1{i2}, bS2{1}, iS3{i4} ], float
  T1_g[ 0 ], float
  T2_g[ iS4{i5}, iS5{i6}, bS6{1}, iS7{i7} ], bool
Outputs:
  T9_g[ 0 ], float

%kernel_math {
T3_l[ iS8{i5}, iS9{i6}, bS10{1}, iS11{i7} ]
   = T2_g[ iS4{i5}, iS5{i6}, bS6{1}, iS7{i7} ];
T4_l[ iS12{i5}, iS13{i6}, bS14{1}, iS15{i7} ]
   = T3_l[ iS8{i5}, iS9{i6}, bS10{1}, iS11{i7} ];
T5_l[ bS16{1}, bS17{1}, bS18{1}, bS19{1} ]
   = broadcast( T1_g[ 0 ] )
T6_l[ iS20{i5}, iS21{i6}, bS22{1}, iS23{i7} ]
   = where(T4_l[ iS12{i5}, iS13{i6}, bS14{1}, iS15{i7} ]
  , T0_g[ iS0{i1}, iS1{i2}, bS2{1}, iS3{i4} ]
  , T5_l[ bS16{1}, bS17{1}, bS18{1}, bS19{1} ]);
T7_l[ iS24{i5}, iS25{i6}, iS26{i7} ]
   = squeeze( T6_l[ iS20{i5}, iS21{i6}, bS22{1}, iS23{i7} ] )
T8_l[ rS27{i5}, rS28{i6}, rS29{i7} ]
   = reduction( T7_l[ iS24{i5}, iS25{i6}, iS26{i7} ], op = fmin, initial value = double(inf), allreduce = false )
T9_g[ 0 ]
   = T8_l[ rS27{i5}, rS28{i6}, rS29{i7} ];
}
```

but the given input T2 has shape `(3, 2, 1, 1)`. The last dim was not correctly marked as broadcast, so I think during codegen, our system is assuming `i4 == i7` and generating code based on that.

### Versions

TOT devel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

slow test failure: test_nvfuser_extremal_values_masked_amin_cuda_float32 #2169

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

slow test failure: test_nvfuser_extremal_values_masked_amin_cuda_float32 #2169

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions