FlashMLA PyTorch

PyTorch implementation of FlashMLA.

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving. Currently released: BF16; Paged kvcache with block size of 64.