Skip to content

Could not find nvcc, please set CUDAToolkit_ROOT #409

Open
@EugeoSynthesisThirtyTwo

Description

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

--n-gpu-layers 36 is supposed to fill my VRAM and use my GPU, it's also supposed to print in the console llama_model_load_internal: [cublas] offloading 36 layers to GPU
and I suppose it should be printing BLAS = 1

Current Behavior

I don't have anything about offloading in the console, my GPU is sleeping, and my VRAM is empty. And it prints BLAS = 0

Environment and Context

Windows 10
CPU with 20 threads
64 Go RAM
RTX 3080 Ti Laptop GPU
16 Go VRAM

Python 3.10.9
fastapi 0.97.0
numpy 1.25.0
starlette 0.27.0
uvicorn 0.22.0

Can't verify "make" and "g++" versions because windows does not find these commands

Steps to Reproduce

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

conda create -n textgen python=3.10.9
conda activate textgen
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir
python server.py --model WizardLM-30B-Uncensored.ggmlv3.q4_0.bin  --n-gpu-layers 36 --auto-devices
  • open the gradio webui
  • generate something in the chat-box to load the model

Failure Logs

bin C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll
C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found
2023-06-20 23:40:24 INFO:Loading WizardLM-30B-Uncensored.ggmlv3.q4_0.bin...
2023-06-20 23:40:24 INFO:llama.cpp weights detected: models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin

2023-06-20 23:40:24 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
2023-06-20 23:40:25 INFO:Loaded the model in 0.97 seconds.

2023-06-20 23:40:25 INFO:Loading the extension "gallery"...
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.

llama_print_timings:        load time = 60209.82 ms
llama_print_timings:      sample time =     1.47 ms /    10 runs   (    0.15 ms per token,  6807.35 tokens per second)
llama_print_timings: prompt eval time = 60209.69 ms /     8 tokens ( 7526.21 ms per token,     0.13 tokens per second)
llama_print_timings:        eval time =  8880.62 ms /     9 runs   (  986.74 ms per token,     1.01 tokens per second)
llama_print_timings:       total time = 69109.63 ms
Output generated in 69.34 seconds (0.13 tokens/s, 9 tokens, context 8, seed 1976462288)

Metadata

Metadata

Assignees

No one assigned

    Labels

    buildduplicateThis issue or pull request already existshardwareHardware specific issuellama.cppProblem with llama.cpp shared libwindowsA Windoze-specific issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions