Description
I've recently compiled the 1.5.4 library for with CUBLAS and having an issue where running multiple whisper_full_with_state()'s.
I did not previously have this issue with the 1.5.1 library.
I re-compiled with DEBUG_CUDA_MALLOC and it has the output:
[29-01-2024 16:04:58:258] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302000000, used: 7680000]
[29-01-2024 16:04:58:263] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302753000, used: 15360000]
[29-01-2024 16:04:58:267] [INFO ] [cuda pool[0]: freed 7680000 bytes at 302000000]
[29-01-2024 16:04:58:272] [INFO ] [GGML_ASSERT: ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])]
I can see its failed on the assert as its not de-allocating in reverse order. It has these allocated "per device", but there is only one actual device (card). In this case, I'd have two instances running and the first executed one has finished first.
Is this some sort of "virtual" device where each call of whisper_full_with_state needs to specify a separate device or is this something with the memory allocation?
I also noticed that libcuda.so is missing when compiling if you don't have the driver installed. I don't have a GPU in the compiler host so I had to copy this onto the host manually.