New Pytorch Triton breaks custom cast kernel MX

# Summary

To run: `TRITON_ALWAYS_COMPILE=1 TRITON_DUMP_DIR=my_directory_2 TRITON_KERNEL_DUMP=1 pytest -s  -v test/prototype/mx_formats/test_custom_cast.py -k "test_fp4_triton_unscaled_cast"`

Bad ttir on left good on right. No real differences
https://www.diffchecker.com/ueX5YZw4

TTGIR:
https://www.diffchecker.com/M5PS6QJg/

Differences in PTX
https://www.diffchecker.com/8mseNnKA/