Could not find nvcc, please set CUDAToolkit_ROOT

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ X] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [ X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ X] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

--n-gpu-layers 36 is supposed to fill my VRAM and use my GPU, it's also supposed to print in the console `llama_model_load_internal: [cublas] offloading 36 layers to GPU`
and I suppose it should be printing `BLAS = 1`

# Current Behavior

I don't have anything about offloading in the console, my GPU is sleeping, and my VRAM is empty. And it prints `BLAS = 0`

# Environment and Context

Windows 10
CPU with 20 threads
64 Go RAM
RTX 3080 Ti Laptop GPU
16 Go VRAM

Python 3.10.9
fastapi 0.97.0
numpy 1.25.0
starlette 0.27.0
uvicorn 0.22.0

Can't verify "make" and "g++" versions because windows does not find these commands

# Steps to Reproduce

- have my gpu (maybe it's an issue with my gpu)
- install oobabooga's text-generation-webui [https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui)
```
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

conda create -n textgen python=3.10.9
conda activate textgen
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir
```
- download a ggml model (I used [WizardLM-30B-Uncensored.ggmlv3.q4_0.bin](https://huggingface.co/TheBloke/WizardLM-30B-Uncensored-GGML/blob/main/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin)) and put it in the "models" folder
- start the webui
```
python server.py --model WizardLM-30B-Uncensored.ggmlv3.q4_0.bin  --n-gpu-layers 36 --auto-devices
```
- open the gradio webui
- generate something in the chat-box to load the model

# Failure Logs

```
bin C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll
C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found
2023-06-20 23:40:24 INFO:Loading WizardLM-30B-Uncensored.ggmlv3.q4_0.bin...
2023-06-20 23:40:24 INFO:llama.cpp weights detected: models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin

2023-06-20 23:40:24 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
2023-06-20 23:40:25 INFO:Loaded the model in 0.97 seconds.

2023-06-20 23:40:25 INFO:Loading the extension "gallery"...
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.

llama_print_timings:        load time = 60209.82 ms
llama_print_timings:      sample time =     1.47 ms /    10 runs   (    0.15 ms per token,  6807.35 tokens per second)
llama_print_timings: prompt eval time = 60209.69 ms /     8 tokens ( 7526.21 ms per token,     0.13 tokens per second)
llama_print_timings:        eval time =  8880.62 ms /     9 runs   (  986.74 ms per token,     1.01 tokens per second)
llama_print_timings:       total time = 69109.63 ms
Output generated in 69.34 seconds (0.13 tokens/s, 9 tokens, context 8, seed 1976462288)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could not find nvcc, please set CUDAToolkit_ROOT #409

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Could not find nvcc, please set CUDAToolkit_ROOT #409

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions