The specialized kernel seems to be slower than the generic one, see [here](https://github.com/FluxML/Flux.jl/pull/1921#issuecomment-1086576955)