Skip to content

Conversation

danielzgtg
Copy link
Collaborator

@danielzgtg danielzgtg commented Aug 10, 2025

perf_battery -mp Kokoro_espeak.gguf -nt 4:

  • Before: 1586.519654 ms, 87.7512%
  • After: 1475.965502 ms, 81.6744%
  • Overall: (-110.554152 ms, -6.97%, -6.0768pp).

TODO: Does not work yet. The output sound is all wrong.

Somehow ggml_vec_dot_f32 used by ggml_compute_forward_mul_mat is too slow. Im2col causes massive matmuls like [8281,1408]@[1408,128]. That is spilling all the way into L3 cache.

This PR aims to keep the per-channel kernels in L1d cache. For this, I just used naïve matmul as our channel count of 128 is large, and performed the convolution by accumulating in a sliding window. The input data will only be read once. The targeted bottleneck convolutions are kokoro\.decoder\.generator\.noise_blocks\.\d\.resblock\.\d\.convs\d_weight.

perf_battery -mp Kokoro_espeak.gguf -nt 4:
- Before: 1586.519654 ms, 87.7512%
- After: 1475.965502 ms, 81.6744%
- Overall: (-110.554152 ms, -6.97%, -6.0768pp).

Somehow ggml_vec_dot_f32 used by ggml_compute_forward_mul_mat is too
slow. Im2col causes massive matmuls like [8281,1408]@[1408,128]. That is
spilling all the way into L3 cache.

This PR aims to keep the per-channel kernels in L1d cache. For this, I
just used naïve matmul as our channel count of 128 is large, and
performed the convolution by accumulating in a sliding window. The input
data will only be read once.
@danielzgtg
Copy link
Collaborator Author

#96 inspired this, but the targeted weights are different. I plan to make breaking changes later to Kokoro_GGUF, but not in this PR.

  • convs[12]_weight here is currently [128,128,K∈{3,7,11}]. This PR transposes it at runtime to [128,K,128], and I'd like to make this permanent
  • decoder_blocks\.[012]\.conv1_weight there is [1024,IC=1090,3]. IC is not a multiple of 32, and this prevents Q[458] quantization even after transposing. We can just that up to 1120 and perhaps use a view to discard the extra elements after
  • decoder_blocks\.[012]\.conv2_weight does not have the odd IC size problem, and should be as easy as convs[12]_weight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant