Commit "CUDA: Quantized matrix matrix multiplication" causes assert "ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1" on Windows when vocab_size != 32000

Using CUDA on Windows when model `vocab_size != 32000`, inference crashes immediately with:

`ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1`

See https://github.com/ggerganov/llama.cpp/pull/2160#issuecomment-1660093406 for more details.  
Reverting to commit before 11f3ca06b8c66b0427aab0a472479da22553b472 resolves the issue.  
Also, the workaround proposed in https://github.com/ggerganov/llama.cpp/pull/2160#issuecomment-1657203763 appears to work (at least for me).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit "CUDA: Quantized matrix matrix multiplication" causes assert "ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1" on Windows when vocab_size != 32000 #2484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Commit "CUDA: Quantized matrix matrix multiplication" causes assert "ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1" on Windows when vocab_size != 32000 #2484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions