CUBLAS compiled but not working with batch_size = 512 #113

gjmulder · 2023-04-25T07:05:15Z

I am trying to run the server with with CUBLAS compiled in. I've upped n_batch to 512, and reduced n_ctx to 512:

CUBLAS v12:

$ ldd _skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so | grep cuda
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007fb90c600000)
	libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007fb90c200000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007fb8e9c00000)

$ python --version
Python 3.10.10

$ git log | head -3
commit 996e31d861ce5c8cfbefe6af52a3da25cf484454
Author: Andrei Betlen <[email protected]>
Date:   Tue Apr 25 01:37:07 2023 -0400

Both are changed to 512 for optimal performance w/alpaca-lora-65B-GGML:

$ grep 512 ./llama_cpp/server/__main__.py 
    n_ctx: int = 512
    n_batch: int = 512

Build seems fine:

$ LLAMA_CUBLAS=1 python3 setup.py develop
/home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
CMake Warning (dev) in CMakeLists.txt:
  A logical block opening on the line

    /home/mulderg/Work/llama-cpp-python.dist/CMakeLists.txt:9 (if)

  closes on the line

    /home/mulderg/Work/llama-cpp-python.dist/CMakeLists.txt:31 (endif)

  with mis-matching arguments.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done
-- Generating done
-- Build files have been written to: /home/mulderg/Work/llama-cpp-python.dist/_skbuild/linux-x86_64-3.10/cmake-build
[100%] Built target run
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /home/mulderg/Work/llama-cpp-python.dist/_skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so
copying _skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so -> llama_cpp/libllama.so

running develop
/home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
writing llama_cpp_python.egg-info/PKG-INFO
writing dependency_links to llama_cpp_python.egg-info/dependency_links.txt
writing requirements to llama_cpp_python.egg-info/requires.txt
writing top-level names to llama_cpp_python.egg-info/top_level.txt
reading manifest file 'llama_cpp_python.egg-info/SOURCES.txt'
adding license file 'LICENSE.md'
writing manifest file 'llama_cpp_python.egg-info/SOURCES.txt'
running build_ext
Creating /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/llama-cpp-python.egg-link (link to .)
Adding llama-cpp-python 0.1.38 to easy-install.pth file

Installed /home/mulderg/Work/llama-cpp-python.dist
Processing dependencies for llama-cpp-python==0.1.38
Searching for typing-extensions==4.5.0
Best match: typing-extensions 4.5.0
Processing typing_extensions-4.5.0-py3.10.egg
typing-extensions 4.5.0 is already the active version in easy-install.pth

Using /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/typing_extensions-4.5.0-py3.10.egg
Finished processing dependencies for llama-cpp-python==0.1.38

Looks installed correctly:

$ pip list | grep llama
llama-cpp-python         0.1.38      /home/mulderg/Work/llama-cpp-python.dist

BLAS enabled. n_ctx = 512, so it looks like has picked up my changes to __main__.py:

$ python3 -m llama_cpp.server
llama.cpp: loading model from /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required  = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size  = 1280.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [105410]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://192.168.1.73:8000 (Press CTRL+C to quit)

nvidia-smi reports python PID # 105410, same as the server above:

$ nvidia-smi 
Tue Apr 25 06:46:15 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti      On | 00000000:09:00.0 Off |                  N/A |
| 25%   38C    P5               12W / 250W|    226MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    105410      C   python3                                     222MiB |
+---------------------------------------------------------------------------------------+

But no GPU utilisation when I hit the API 😞

The sun sets on our plans,
A complex web of delusions,
We cannot comprehend (0m23s)

The text was updated successfully, but these errors were encountered:

abetlen · 2023-04-25T13:04:16Z

@gjmulder what does the utilisation look like when running cuBLAS with the llama.cpp examples?

gjmulder · 2023-04-25T13:18:52Z

Currently exploring different batch sizes using ./perplexity to see if I can squeeze out some more performance. ./perplexity was compiled by running make in /vendor/llama.cpp/:

vendor/llama.cpp$ ldd ./perplexity | grep cublas
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007fa2e2800000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007fa2bfe00000)

nvidia-smi output every 1 second:

Card,Instance,GPU Util %, Mem Util %, Mem Used
NVIDIA GeForce GTX 1080 Ti, 0, 81 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 10 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 13 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 57 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 69 %, 6 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 12 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 64 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 96 %, 11 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 57 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 75 %, 7 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 11 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 95 %, 10 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 81 %, 6 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 10 %, 2552 MiB

gjmulder · 2023-05-02T15:44:05Z

See #131. It looks to be and issue with libllama.so

Free-Radical · 2023-05-09T06:20:37Z

Hi, update I compilied per instructions for GPU-

LLAMA_CUBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python

No errors. But the when running the model in jupyter notebook, i clearly see the model load with BLAS=0, and run a lot slower than compared to lllama.cpp (cuBLAS enabled)

Did i do something wrong or we have an issue?

joelkurian · 2023-05-10T08:48:27Z

@Free-Radical Try with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python.

Free-Radical · 2023-05-15T20:43:29Z

@Free-Radical Try with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python.

Did not work, BLAS = 0

gjmulder mentioned this issue Apr 30, 2023

Directions to compile to use GPU? #131

Closed

abetlen added the bug Something isn't working label May 5, 2023

gjmulder closed this as completed May 15, 2023

Free-Radical mentioned this issue May 15, 2023

cuBLAS, GPU compile instructions not working #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUBLAS compiled but not working with batch_size = 512 #113

CUBLAS compiled but not working with batch_size = 512 #113

gjmulder commented Apr 25, 2023

abetlen commented Apr 25, 2023

gjmulder commented Apr 25, 2023 •

edited

Loading

gjmulder commented May 2, 2023

Free-Radical commented May 9, 2023

joelkurian commented May 10, 2023

Free-Radical commented May 15, 2023

CUBLAS compiled but not working with batch_size = 512 #113

CUBLAS compiled but not working with batch_size = 512 #113

Comments

gjmulder commented Apr 25, 2023

abetlen commented Apr 25, 2023

gjmulder commented Apr 25, 2023 • edited Loading

gjmulder commented May 2, 2023

Free-Radical commented May 9, 2023

joelkurian commented May 10, 2023

Free-Radical commented May 15, 2023

gjmulder commented Apr 25, 2023 •

edited

Loading