-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi, I can confirm im getting the same issue on master (it was pulled after v2.11 cuda cublas12-ffmpeg images became available)
|
I confirm the same issue. it's critical |
Can you please share the logs with |
Hello @mudler I posted some of the logs above, would you like to see more? |
@Anto79-ops your log looks like incomplete, it seems it failed initially in a way that made the previous calls failing. Can you share the full log from the beginning of the session? |
@mudler is it ok I email/dm a text file of the logs? |
I just pulled the lastest master image and the problem is solved (for me, at least). Thank you! |
#1981 is related you get this error because llama-cpp backend tries to offload whole model to GPU and fail because you have not enough VRAM Workaround might be if you offload only part of your model layers to GPU You need to create
you should play aroung gpu_layers here, and check |
I have this error with a custom model NeuralHermes. I have asked for help #1992 |
Have you checked that your VRAM is enough to offload all layers? you can try to split it |
@JackBekket is running in my preprod server nvidia L4 the models that comes with the distro are running perfectly. |
@mudler I have the answer I downloaded the raw link file that its just plain text 🤦 |
You're welcome! I'm glad you found the issue and managed to resolve it. If you need any further assistance, don't hesitate to reach out. Have a great day! |
I'm having a similar issue. The following log:
I've highlighted the lines that sort of stood out to me. It would be good to have customized model files with examples using different backends. |
I'm using an all-in-one container with GPU support. And when I try to generate an image, I get the following error: 2:53PM INF Success ip=10.0.32.20 latency="35.826µs" method=GET status=200 url=/static/general.css |
LocalAI version:
Latest
Environment, CPU architecture, OS, and Version:
EC-2
Describe the bug
Getting the grpc connection error when running using cuda12 image. But when running through Vanilla/cpu image, its working fine.
Using docker-compose to start the server.
To Reproduce
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "luna-ai-llama2", "prompt": "A long time ago in a galaxy far, far away","temperature": 0.7}'
Expected behavior
I need to run llm on GPU for inference, tried all images available but still same error persists
Logs
12:08PM INF Trying to load the model 'luna-ai-llama2' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/vall-e-x/run.sh
12:08PM INF [llama-cpp] Attempting to load
12:08PM INF Loading model 'luna-ai-llama2' with backend llama-cpp
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37313:connect: connection refused"
12:09PM INF [llama-cpp] Fails: grpc service not ready
12:09PM INF [llama-ggml] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend llama-ggml
12:09PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [gpt4all] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend gpt4all
12:09PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [bert-embeddings] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend bert-embeddings
12:09PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [rwkv] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend rwkv
12:09PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [whisper] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend whisper
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35143:connect: connection refused"
12:09PM INF [whisper] Fails: grpc service not ready
12:09PM INF [stablediffusion] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend
Additional context
I think people have faced similar problem earlier also but I couldn't find any solution. Kindly let me know if someone have any workarounds!!!!!
The text was updated successfully, but these errors were encountered: