"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

rohan902 · 2024-03-27T13:21:22Z

LocalAI version:
Latest

Environment, CPU architecture, OS, and Version:
EC-2

Describe the bug
Getting the grpc connection error when running using cuda12 image. But when running through Vanilla/cpu image, its working fine.
Using docker-compose to start the server.

To Reproduce
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "luna-ai-llama2", "prompt": "A long time ago in a galaxy far, far away","temperature": 0.7}'

Expected behavior
I need to run llm on GPU for inference, tried all images available but still same error persists

Logs

12:08PM INF Trying to load the model 'luna-ai-llama2' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/vall-e-x/run.sh
12:08PM INF [llama-cpp] Attempting to load
12:08PM INF Loading model 'luna-ai-llama2' with backend llama-cpp
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37313:connect: connection refused"
12:09PM INF [llama-cpp] Fails: grpc service not ready
12:09PM INF [llama-ggml] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend llama-ggml
12:09PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [gpt4all] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend gpt4all
12:09PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [bert-embeddings] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend bert-embeddings
12:09PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [rwkv] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend rwkv
12:09PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [whisper] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend whisper
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35143:connect: connection refused"
12:09PM INF [whisper] Fails: grpc service not ready
12:09PM INF [stablediffusion] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend

Additional context
I think people have faced similar problem earlier also but I couldn't find any solution. Kindly let me know if someone have any workarounds!!!!!

Anto79-ops · 2024-03-27T16:42:39Z

hi, I can confirm im getting the same issue on master (it was pulled after v2.11 cuda cublas12-ffmpeg images became available)

2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM DBG Stopping all backends except 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf'
2:46PM INF Trying to load the model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/diffusers/run.sh
2:46PM INF [llama-cpp] Attempting to load
2:46PM INF Loading model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with backend llama-cpp
2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf

mr-v-v-v · 2024-03-27T20:27:24Z

I confirm the same issue. it's critical

mudler · 2024-03-28T11:48:53Z

Can you please share the logs with DEBUG=true? also, how are you using the image? is with a GPU I suppose?

Anto79-ops · 2024-03-28T14:48:16Z

Hello @mudler I posted some of the logs above, would you like to see more?

mudler · 2024-03-28T15:15:41Z

@Anto79-ops your log looks like incomplete, it seems it failed initially in a way that made the previous calls failing. Can you share the full log from the beginning of the session?

Anto79-ops · 2024-03-29T14:28:39Z

@mudler is it ok I email/dm a text file of the logs?

Anto79-ops · 2024-04-02T20:34:55Z

I just pulled the lastest master image and the problem is solved (for me, at least).

Thank you!

JackBekket · 2024-04-09T20:19:47Z

#1981 is related

you get this error because llama-cpp backend tries to offload whole model to GPU and fail because you have not enough VRAM

Workaround might be if you offload only part of your model layers to GPU

You need to create .yaml config file for your model like this:

name: wizard-uncensored-13b
f16: false # true to GPU acceleration
cuda: false # true to GPU acceleration
gpu_layers: 10 # this model have max 40 layers, 15-20 is reccomended for half-load at NVIDIA 4060 TiTan (more layers -- more VRAM required), (i guess 0 is no GPU)
parameters:
  model: wizard-uncensored-13b.gguf
#backend: diffusers
template:

  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

you should play aroung gpu_layers here, and check nvidia-smi

DavidGOrtega · 2024-04-10T22:03:21Z

I have this error with a custom model NeuralHermes. I have asked for help #1992

JackBekket · 2024-04-11T22:39:00Z

I have this error with a custom model NeuralHermes. I have asked for help #1992

Have you checked that your VRAM is enough to offload all layers? you can try to split it

DavidGOrtega · 2024-04-12T01:14:52Z

@JackBekket is running in my preprod server

nvidia L4
32 cores
90 GB

the models that comes with the distro are running perfectly.

DavidGOrtega · 2024-04-12T11:46:26Z

@mudler I have the answer I downloaded the raw link file that its just plain text 🤦
thanks for your help

localai-bot · 2024-04-15T17:45:41Z

You're welcome! I'm glad you found the issue and managed to resolve it. If you need any further assistance, don't hesitate to reach out. Have a great day!

ytjhai · 2024-07-17T22:01:34Z

I'm having a similar issue. The following log:

api-1  | 9:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
api-1  | 9:50PM DBG processing api keys runtime update
api-1  | 9:50PM DBG processing external_backends.json
api-1  | 9:50PM DBG external backends loaded from external_backends.json
api-1  | 9:50PM INF core/startup process completed!
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistants.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
api-1  | 9:50PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
api-1  | 9:50PM DBG Request received: {"model":"gte-qwen","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":"Your text string goes here","stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1  | 9:50PM DBG guessDefaultsFromFile: not a GGUF file
api-1  | 9:50PM DBG Parameter Config: &{PredictionOptions:{Model:Alibaba-NLP/gte-Qwen2-7B-instruct Language: Translate:false N:0 TopP:0x4000630b90 TopK:0x4000630b68 Temperature:0x4000630a18 Maxtokens:0x4000630fc8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x4000630fc0 TypicalP:0x4000630f08 Seed:0x40006310a0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gte-qwen F16:0x4000630cb0 Threads:0x4000630cb8 Debug:0x4000585ab0 Roles:map[] Embeddings:0x4000630fe9 Backend:huggingface-embeddings TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[Your text string goes here] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x4000630f00 MirostatTAU:0x4000630ee8 Mirostat:0x4000630ee0 NGPULayers:0x4000630fe0 MMap:0x4000630a17 MMlock:0x4000630fe9 LowVRAM:0x4000630fe9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x4000630c30 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
api-1  | 9:50PM INF Loading model 'Alibaba-NLP/gte-Qwen2-7B-instruct' with backend huggingface-embeddings
api-1  | 9:50PM DBG Loading model in memory from file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct
api-1  | 9:50PM DBG Loading Model Alibaba-NLP/gte-Qwen2-7B-instruct with gRPC (file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct) (backend: huggingface-embeddings): {backendString:huggingface-embeddings model:Alibaba-NLP/gte-Qwen2-7B-instruct threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x4000239b08 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
api-1  | 9:50PM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG GRPC Service for Alibaba-NLP/gte-Qwen2-7B-instruct will be running at: '127.0.0.1:33329'
api-1  | 9:50PM DBG GRPC Service state dir: /tmp/go-processmanager1272549319
api-1  | 9:50PM DBG GRPC Service Started
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout Initializing libbackend for build
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv created
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 78: uv: command not found**
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 83:** /build/backend/python/sentencetransformers/venv/bin/activate: No such file or directory
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 155: exec: python: not found**
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv activated
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout activated virtualenv has been ensured
api-1  | 9:51PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:33329: connect: connection refused\""
api-1  | 9:51PM DBG GRPC Service NOT ready
api-1  | 9:51PM ERR Server error error="grpc service not ready" ip=192.168.65.1 latency=40.12671406s method=POST status=500 url=/embeddings

I've highlighted the lines that sort of stood out to me. It would be good to have customized model files with examples using different backends.

haema5 · 2024-09-12T14:58:28Z

I'm using an all-in-one container with GPU support. And when I try to generate an image, I get the following error:
`2:53PM INF Success ip=10.0.32.20 latency="408.897µs" method=GET status=200 url=/text2image/

2:53PM INF Success ip=10.0.32.20 latency="35.826µs" method=GET status=200 url=/static/general.css
2:53PM INF Success ip=10.0.32.20 latency="37.071µs" method=GET status=200 url=/static/assets/highlightjs.css
2:53PM INF Success ip=10.0.32.20 latency="36.818µs" method=GET status=200 url=/static/assets/highlightjs.js
2:53PM INF Success ip=10.0.32.20 latency="34.684µs" method=GET status=200 url=/static/assets/font1.css
2:53PM INF Success ip=10.0.32.20 latency="30.093µs" method=GET status=200 url=/static/assets/font2.css
2:53PM INF Success ip=10.0.32.20 latency="26.333µs" method=GET status=200 url=/static/assets/tw-elements.css
2:53PM INF Success ip=10.0.32.20 latency="28.7µs" method=GET status=200 url=/static/assets/fontawesome/css/fontawesome.css
2:53PM INF Success ip=10.0.32.20 latency="24.974µs" method=GET status=200 url=/static/assets/fontawesome/css/brands.css
2:53PM INF Success ip=10.0.32.20 latency="29.899µs" method=GET status=200 url=/static/assets/fontawesome/css/solid.css
2:53PM INF Success ip=10.0.32.20 latency="34.853µs" method=GET status=200 url=/static/assets/tailwindcss.js
2:53PM INF Success ip=10.0.32.20 latency="61.305µs" method=GET status=200 url=/static/assets/htmx.js
2:53PM INF Success ip=10.0.32.20 latency="16.214µs" method=GET status=200 url=/static/assets/tw-elements.js
2:53PM INF Success ip=10.0.32.20 latency="38.009µs" method=GET status=200 url=/static/assets/marked.js
2:53PM INF Success ip=10.0.32.20 latency="31.57µs" method=GET status=200 url=/static/assets/alpine.js
2:53PM INF Success ip=10.0.32.20 latency="45.17µs" method=GET status=200 url=/static/assets/purify.js
2:53PM INF Success ip=10.0.32.20 latency="30.934µs" method=GET status=200 url=/static/image.js
2:53PM INF Success ip=10.0.32.20 latency="32.577µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuFuYMZg.ttf
2:53PM INF Success ip=10.0.32.20 latency="27.916µs" method=GET status=200 url=/static/assets/fontawesome/webfonts/fa-solid-900.woff2
2:53PM INF Success ip=10.0.32.20 latency="30.196µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuLyfMZg.ttf
2:53PM INF Success ip=10.0.32.20 latency="29.258µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuGKYMZg.ttf
2:54PM INF Success ip=127.0.0.1 latency="11.14µs" method=GET status=200 url=/readyz
2:55PM INF Success ip=127.0.0.1 latency="9.156µs" method=GET status=200 url=/readyz
2:55PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend stablediffusion
2:55PM ERR Server error error="rpc error: code = Unimplemented desc = " ip=10.0.32.20 latency=8.927992ms method=POST status=500 url=/v1/images/generations`
I tried different versions of containers.

rohan902 added bug Something isn't working unconfirmed labels Mar 27, 2024

JackBekket mentioned this issue Apr 9, 2024

llama-cpp backend (aio-gpu-cuda12) does not respect GPU parameters in model config #1981

Closed

DavidGOrtega mentioned this issue Apr 12, 2024

rpc error: code = Unknown desc = unimplemented instead of chat completion #1946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

rohan902 commented Mar 27, 2024

Anto79-ops commented Mar 27, 2024 •

edited

Loading

mr-v-v-v commented Mar 27, 2024

mudler commented Mar 28, 2024

Anto79-ops commented Mar 28, 2024

mudler commented Mar 28, 2024

Anto79-ops commented Mar 29, 2024 •

edited

Loading

Anto79-ops commented Apr 2, 2024 •

edited

Loading

JackBekket commented Apr 9, 2024

DavidGOrtega commented Apr 10, 2024

JackBekket commented Apr 11, 2024

DavidGOrtega commented Apr 12, 2024 •

edited

Loading

DavidGOrtega commented Apr 12, 2024

localai-bot commented Apr 15, 2024

ytjhai commented Jul 17, 2024

haema5 commented Sep 12, 2024 •

edited

Loading

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

Comments

rohan902 commented Mar 27, 2024

Anto79-ops commented Mar 27, 2024 • edited Loading

mr-v-v-v commented Mar 27, 2024

mudler commented Mar 28, 2024

Anto79-ops commented Mar 28, 2024

mudler commented Mar 28, 2024

Anto79-ops commented Mar 29, 2024 • edited Loading

Anto79-ops commented Apr 2, 2024 • edited Loading

JackBekket commented Apr 9, 2024

DavidGOrtega commented Apr 10, 2024

JackBekket commented Apr 11, 2024

DavidGOrtega commented Apr 12, 2024 • edited Loading

DavidGOrtega commented Apr 12, 2024

localai-bot commented Apr 15, 2024

ytjhai commented Jul 17, 2024

haema5 commented Sep 12, 2024 • edited Loading

Anto79-ops commented Mar 27, 2024 •

edited

Loading

Anto79-ops commented Mar 29, 2024 •

edited

Loading

Anto79-ops commented Apr 2, 2024 •

edited

Loading

DavidGOrtega commented Apr 12, 2024 •

edited

Loading

haema5 commented Sep 12, 2024 •

edited

Loading