-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama : add thread safety test #14035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
llama : ignore main_gpu <= 0 if there are no GPUs ggml-ci
Maybe we can use an even smaller model for this test: |
The SYCL ggml-ci does not seem to have libcurl installed yet. |
Should be installed now. |
2c5874e
to
a2a0289
Compare
a2a0289
to
b046f0c
Compare
There is some issue with this model (stories15M-q4_0.gguf) on CPU, but I don't think it is a threading issue. Only seems to happen on CPUs with AVX512.
|
I looked into it a bit and it does not seem to happen if OpenMP is disabled. Think it is something related to the repacking, but didn't confirm. I'll take an extra look now. |
Pretty sure this is a data-race because the chunk counter will be shared by all contexts: llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp Lines 395 to 398 in 487a5e0
If I disable @Djip007 Could you take a look and propose a fix? |
ggml-ci
429 is "too many requests". @ngxson do you know if it is a temporary issue with huggingface, or are we being throttled? |
HF backend currently has a problem, the team is investigating, should be back very soon |
@0cc4m @jeffbolznv The Vulkan backend is crashing on this test. It happens even with a single context per model ( |
It is known that the Vulkan backend is not thread-safe yet, yes. |
Are you planning to disable all Vulkan CI coverage due to this one failing test? |
I don't think that disabling the tests is the best option, but if I don't do that people are going to complain that the CI is failing on every PR. I guess I could disable just this test on the Vulkan CI, but that will just make it easier to ignore this bug. |
@0cc4m as a short term fix would it be crazy to just grab a mutex in most/all the ggml backend entry points? I tried that and it sort of works... I still get corruption sometimes in the output. But then again, I also get corruption sometimes using the CUDA backend so I'm not sure if this is the fault of ggml-vulkan. With the CUDA backend I sometimes get errors like:
or
My command line is |
The CUDA errors that I could reproduce should have been fixed in #14033. There may be other issues still, but none that I could reproduce on my system. |
Is that fixes included in this branch? I just fetched |
It should be merged now, it wasn't before. |
Strange, I rebuilt and I'm still seeing the same failures at about the same rate (maybe 1 in 4 attempts). Which operation it says fails looks random. |
If that helps, we could do that, but the problem is that not all relevant resources of the backend are stored in relation to the backend context yet, so multiple contexts can use the same descriptors, for example. It's annoying to shift around these resources in a way that enables this, but maybe it is time to look at it. |
I was able to reproduce the CUDA issue. It only happens with the additional instance of the model that is (intended to be) run on the CPU only. I had done most of my testing before adding that instance and I didn't expect it to cause issues the CUDA since the goal was mainly to test llama.cpp, so I didn't catch it before. I tried a few things, but even with |
After a hundred test runs, I can put in my two penny worth: I have commented out two lines to reduce the number of simultaneously running threads and avoid thread starvation:
Run test-thread-safety on:
...with args:
... to finally abort:
|
OK, I'll take a look at this, as a start, and see how far it gets us. |
It seems to be working now with Vulkan, in my tests. |
This reverts commit 9381f4e.
when main_gpu < 0 GPU devices are not used
Basic thread safety tests that loads a copy of the model on each GPU and CPU, and runs inference with multiple contexts in different threads.