Skip to content

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rohan902 opened this issue Mar 27, 2024 · 15 comments
Labels
bug Something isn't working unconfirmed

Comments

@rohan902
Copy link

LocalAI version:
Latest

Environment, CPU architecture, OS, and Version:
EC-2

Describe the bug
Getting the grpc connection error when running using cuda12 image. But when running through Vanilla/cpu image, its working fine.
Using docker-compose to start the server.

To Reproduce
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "luna-ai-llama2", "prompt": "A long time ago in a galaxy far, far away","temperature": 0.7}'

Expected behavior
I need to run llm on GPU for inference, tried all images available but still same error persists

Logs

12:08PM INF Trying to load the model 'luna-ai-llama2' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/vall-e-x/run.sh
12:08PM INF [llama-cpp] Attempting to load
12:08PM INF Loading model 'luna-ai-llama2' with backend llama-cpp
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37313:connect: connection refused"
12:09PM INF [llama-cpp] Fails: grpc service not ready
12:09PM INF [llama-ggml] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend llama-ggml
12:09PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [gpt4all] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend gpt4all
12:09PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [bert-embeddings] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend bert-embeddings
12:09PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:09PM INF [rwkv] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend rwkv
12:09PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:09PM INF [whisper] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend whisper
12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35143:connect: connection refused"
12:09PM INF [whisper] Fails: grpc service not ready
12:09PM INF [stablediffusion] Attempting to load
12:09PM INF Loading model 'luna-ai-llama2' with backend

Additional context
I think people have faced similar problem earlier also but I couldn't find any solution. Kindly let me know if someone have any workarounds!!!!!

@rohan902 rohan902 added bug Something isn't working unconfirmed labels Mar 27, 2024
@Anto79-ops
Copy link

Anto79-ops commented Mar 27, 2024

hi, I can confirm im getting the same issue on master (it was pulled after v2.11 cuda cublas12-ffmpeg images became available)

2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM DBG Stopping all backends except 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf'
2:46PM INF Trying to load the model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/diffusers/run.sh
2:46PM INF [llama-cpp] Attempting to load
2:46PM INF Loading model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with backend llama-cpp
2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf

@mr-v-v-v
Copy link

I confirm the same issue. it's critical

@mudler
Copy link
Owner

mudler commented Mar 28, 2024

Can you please share the logs with DEBUG=true? also, how are you using the image? is with a GPU I suppose?

@Anto79-ops
Copy link

Hello @mudler I posted some of the logs above, would you like to see more?

@mudler
Copy link
Owner

mudler commented Mar 28, 2024

@Anto79-ops your log looks like incomplete, it seems it failed initially in a way that made the previous calls failing. Can you share the full log from the beginning of the session?

@Anto79-ops
Copy link

Anto79-ops commented Mar 29, 2024

@mudler is it ok I email/dm a text file of the logs?

@Anto79-ops
Copy link

Anto79-ops commented Apr 2, 2024

I just pulled the lastest master image and the problem is solved (for me, at least).

Thank you!

@JackBekket
Copy link
Contributor

#1981 is related

you get this error because llama-cpp backend tries to offload whole model to GPU and fail because you have not enough VRAM

Workaround might be if you offload only part of your model layers to GPU

You need to create .yaml config file for your model like this:

name: wizard-uncensored-13b
f16: false # true to GPU acceleration
cuda: false # true to GPU acceleration
gpu_layers: 10 # this model have max 40 layers, 15-20 is reccomended for half-load at NVIDIA 4060 TiTan (more layers -- more VRAM required), (i guess 0 is no GPU)
parameters:
  model: wizard-uncensored-13b.gguf
#backend: diffusers
template:

  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

you should play aroung gpu_layers here, and check nvidia-smi

@DavidGOrtega
Copy link

I have this error with a custom model NeuralHermes. I have asked for help #1992

@JackBekket
Copy link
Contributor

I have this error with a custom model NeuralHermes. I have asked for help #1992

Have you checked that your VRAM is enough to offload all layers? you can try to split it

@DavidGOrtega
Copy link

DavidGOrtega commented Apr 12, 2024

@JackBekket is running in my preprod server

nvidia L4
32 cores
90 GB

the models that comes with the distro are running perfectly.

@DavidGOrtega
Copy link

@mudler I have the answer I downloaded the raw link file that its just plain text 🤦
thanks for your help

@localai-bot
Copy link
Contributor

You're welcome! I'm glad you found the issue and managed to resolve it. If you need any further assistance, don't hesitate to reach out. Have a great day!

@ytjhai
Copy link

ytjhai commented Jul 17, 2024

I'm having a similar issue. The following log:

api-1  | 9:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
api-1  | 9:50PM DBG processing api keys runtime update
api-1  | 9:50PM DBG processing external_backends.json
api-1  | 9:50PM DBG external backends loaded from external_backends.json
api-1  | 9:50PM INF core/startup process completed!
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistants.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
api-1  | 9:50PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
api-1  | 9:50PM DBG Request received: {"model":"gte-qwen","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":"Your text string goes here","stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1  | 9:50PM DBG guessDefaultsFromFile: not a GGUF file
api-1  | 9:50PM DBG Parameter Config: &{PredictionOptions:{Model:Alibaba-NLP/gte-Qwen2-7B-instruct Language: Translate:false N:0 TopP:0x4000630b90 TopK:0x4000630b68 Temperature:0x4000630a18 Maxtokens:0x4000630fc8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x4000630fc0 TypicalP:0x4000630f08 Seed:0x40006310a0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gte-qwen F16:0x4000630cb0 Threads:0x4000630cb8 Debug:0x4000585ab0 Roles:map[] Embeddings:0x4000630fe9 Backend:huggingface-embeddings TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[Your text string goes here] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x4000630f00 MirostatTAU:0x4000630ee8 Mirostat:0x4000630ee0 NGPULayers:0x4000630fe0 MMap:0x4000630a17 MMlock:0x4000630fe9 LowVRAM:0x4000630fe9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x4000630c30 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
api-1  | 9:50PM INF Loading model 'Alibaba-NLP/gte-Qwen2-7B-instruct' with backend huggingface-embeddings
api-1  | 9:50PM DBG Loading model in memory from file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct
api-1  | 9:50PM DBG Loading Model Alibaba-NLP/gte-Qwen2-7B-instruct with gRPC (file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct) (backend: huggingface-embeddings): {backendString:huggingface-embeddings model:Alibaba-NLP/gte-Qwen2-7B-instruct threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x4000239b08 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
api-1  | 9:50PM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG GRPC Service for Alibaba-NLP/gte-Qwen2-7B-instruct will be running at: '127.0.0.1:33329'
api-1  | 9:50PM DBG GRPC Service state dir: /tmp/go-processmanager1272549319
api-1  | 9:50PM DBG GRPC Service Started
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout Initializing libbackend for build
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv created
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 78: uv: command not found**
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 83:** /build/backend/python/sentencetransformers/venv/bin/activate: No such file or directory
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 155: exec: python: not found**
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv activated
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout activated virtualenv has been ensured
api-1  | 9:51PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:33329: connect: connection refused\""
api-1  | 9:51PM DBG GRPC Service NOT ready
api-1  | 9:51PM ERR Server error error="grpc service not ready" ip=192.168.65.1 latency=40.12671406s method=POST status=500 url=/embeddings

I've highlighted the lines that sort of stood out to me. It would be good to have customized model files with examples using different backends.

@haema5
Copy link

haema5 commented Sep 12, 2024

I'm using an all-in-one container with GPU support. And when I try to generate an image, I get the following error:
`2:53PM INF Success ip=10.0.32.20 latency="408.897µs" method=GET status=200 url=/text2image/

2:53PM INF Success ip=10.0.32.20 latency="35.826µs" method=GET status=200 url=/static/general.css
2:53PM INF Success ip=10.0.32.20 latency="37.071µs" method=GET status=200 url=/static/assets/highlightjs.css
2:53PM INF Success ip=10.0.32.20 latency="36.818µs" method=GET status=200 url=/static/assets/highlightjs.js
2:53PM INF Success ip=10.0.32.20 latency="34.684µs" method=GET status=200 url=/static/assets/font1.css
2:53PM INF Success ip=10.0.32.20 latency="30.093µs" method=GET status=200 url=/static/assets/font2.css
2:53PM INF Success ip=10.0.32.20 latency="26.333µs" method=GET status=200 url=/static/assets/tw-elements.css
2:53PM INF Success ip=10.0.32.20 latency="28.7µs" method=GET status=200 url=/static/assets/fontawesome/css/fontawesome.css
2:53PM INF Success ip=10.0.32.20 latency="24.974µs" method=GET status=200 url=/static/assets/fontawesome/css/brands.css
2:53PM INF Success ip=10.0.32.20 latency="29.899µs" method=GET status=200 url=/static/assets/fontawesome/css/solid.css
2:53PM INF Success ip=10.0.32.20 latency="34.853µs" method=GET status=200 url=/static/assets/tailwindcss.js
2:53PM INF Success ip=10.0.32.20 latency="61.305µs" method=GET status=200 url=/static/assets/htmx.js
2:53PM INF Success ip=10.0.32.20 latency="16.214µs" method=GET status=200 url=/static/assets/tw-elements.js
2:53PM INF Success ip=10.0.32.20 latency="38.009µs" method=GET status=200 url=/static/assets/marked.js
2:53PM INF Success ip=10.0.32.20 latency="31.57µs" method=GET status=200 url=/static/assets/alpine.js
2:53PM INF Success ip=10.0.32.20 latency="45.17µs" method=GET status=200 url=/static/assets/purify.js
2:53PM INF Success ip=10.0.32.20 latency="30.934µs" method=GET status=200 url=/static/image.js
2:53PM INF Success ip=10.0.32.20 latency="32.577µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuFuYMZg.ttf
2:53PM INF Success ip=10.0.32.20 latency="27.916µs" method=GET status=200 url=/static/assets/fontawesome/webfonts/fa-solid-900.woff2
2:53PM INF Success ip=10.0.32.20 latency="30.196µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuLyfMZg.ttf
2:53PM INF Success ip=10.0.32.20 latency="29.258µs" method=GET status=200 url=/static/assets/UcCO3FwrK3iLTeHuS_fvQtMwCp50KnMw2boKoduKmMEVuGKYMZg.ttf
2:54PM INF Success ip=127.0.0.1 latency="11.14µs" method=GET status=200 url=/readyz
2:55PM INF Success ip=127.0.0.1 latency="9.156µs" method=GET status=200 url=/readyz
2:55PM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend stablediffusion
2:55PM ERR Server error error="rpc error: code = Unimplemented desc = " ip=10.0.32.20 latency=8.927992ms method=POST status=500 url=/v1/images/generations`
I tried different versions of containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

9 participants