Open
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ X] I carefully followed the README.md.
- [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
--n-gpu-layers 36 is supposed to fill my VRAM and use my GPU, it's also supposed to print in the console llama_model_load_internal: [cublas] offloading 36 layers to GPU
and I suppose it should be printing BLAS = 1
Current Behavior
I don't have anything about offloading in the console, my GPU is sleeping, and my VRAM is empty. And it prints BLAS = 0
Environment and Context
Windows 10
CPU with 20 threads
64 Go RAM
RTX 3080 Ti Laptop GPU
16 Go VRAM
Python 3.10.9
fastapi 0.97.0
numpy 1.25.0
starlette 0.27.0
uvicorn 0.22.0
Can't verify "make" and "g++" versions because windows does not find these commands
Steps to Reproduce
- have my gpu (maybe it's an issue with my gpu)
- install oobabooga's text-generation-webui https://github.com/oobabooga/text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
conda create -n textgen python=3.10.9
conda activate textgen
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir
- download a ggml model (I used WizardLM-30B-Uncensored.ggmlv3.q4_0.bin) and put it in the "models" folder
- start the webui
python server.py --model WizardLM-30B-Uncensored.ggmlv3.q4_0.bin --n-gpu-layers 36 --auto-devices
- open the gradio webui
- generate something in the chat-box to load the model
Failure Logs
bin C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll
C:\Users\Armaguedin\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found
2023-06-20 23:40:24 INFO:Loading WizardLM-30B-Uncensored.ggmlv3.q4_0.bin...
2023-06-20 23:40:24 INFO:llama.cpp weights detected: models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
2023-06-20 23:40:24 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 19756.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
2023-06-20 23:40:25 INFO:Loaded the model in 0.97 seconds.
2023-06-20 23:40:25 INFO:Loading the extension "gallery"...
Running on local URL: http://0.0.0.0:7861
To create a public link, set `share=True` in `launch()`.
llama_print_timings: load time = 60209.82 ms
llama_print_timings: sample time = 1.47 ms / 10 runs ( 0.15 ms per token, 6807.35 tokens per second)
llama_print_timings: prompt eval time = 60209.69 ms / 8 tokens ( 7526.21 ms per token, 0.13 tokens per second)
llama_print_timings: eval time = 8880.62 ms / 9 runs ( 986.74 ms per token, 1.01 tokens per second)
llama_print_timings: total time = 69109.63 ms
Output generated in 69.34 seconds (0.13 tokens/s, 9 tokens, context 8, seed 1976462288)