Skip to content

CUBLAS compiled but not working with batch_size = 512 #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gjmulder opened this issue Apr 25, 2023 · 6 comments
Closed

CUBLAS compiled but not working with batch_size = 512 #113

gjmulder opened this issue Apr 25, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@gjmulder
Copy link
Contributor

I am trying to run the server with with CUBLAS compiled in. I've upped n_batch to 512, and reduced n_ctx to 512:

CUBLAS v12:

$ ldd _skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so | grep cuda
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007fb90c600000)
	libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007fb90c200000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007fb8e9c00000)
$ python --version
Python 3.10.10
$ git log | head -3
commit 996e31d861ce5c8cfbefe6af52a3da25cf484454
Author: Andrei Betlen <[email protected]>
Date:   Tue Apr 25 01:37:07 2023 -0400

Both are changed to 512 for optimal performance w/alpaca-lora-65B-GGML:

$ grep 512 ./llama_cpp/server/__main__.py 
    n_ctx: int = 512
    n_batch: int = 512

Build seems fine:

$ LLAMA_CUBLAS=1 python3 setup.py develop
/home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
CMake Warning (dev) in CMakeLists.txt:
  A logical block opening on the line

    /home/mulderg/Work/llama-cpp-python.dist/CMakeLists.txt:9 (if)

  closes on the line

    /home/mulderg/Work/llama-cpp-python.dist/CMakeLists.txt:31 (endif)

  with mis-matching arguments.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done
-- Generating done
-- Build files have been written to: /home/mulderg/Work/llama-cpp-python.dist/_skbuild/linux-x86_64-3.10/cmake-build
[100%] Built target run
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /home/mulderg/Work/llama-cpp-python.dist/_skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so
copying _skbuild/linux-x86_64-3.10/cmake-install/llama_cpp/libllama.so -> llama_cpp/libllama.so

running develop
/home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
writing llama_cpp_python.egg-info/PKG-INFO
writing dependency_links to llama_cpp_python.egg-info/dependency_links.txt
writing requirements to llama_cpp_python.egg-info/requires.txt
writing top-level names to llama_cpp_python.egg-info/top_level.txt
reading manifest file 'llama_cpp_python.egg-info/SOURCES.txt'
adding license file 'LICENSE.md'
writing manifest file 'llama_cpp_python.egg-info/SOURCES.txt'
running build_ext
Creating /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/llama-cpp-python.egg-link (link to .)
Adding llama-cpp-python 0.1.38 to easy-install.pth file

Installed /home/mulderg/Work/llama-cpp-python.dist
Processing dependencies for llama-cpp-python==0.1.38
Searching for typing-extensions==4.5.0
Best match: typing-extensions 4.5.0
Processing typing_extensions-4.5.0-py3.10.egg
typing-extensions 4.5.0 is already the active version in easy-install.pth

Using /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/typing_extensions-4.5.0-py3.10.egg
Finished processing dependencies for llama-cpp-python==0.1.38

Looks installed correctly:

$ pip list | grep llama
llama-cpp-python         0.1.38      /home/mulderg/Work/llama-cpp-python.dist

BLAS enabled. n_ctx = 512, so it looks like has picked up my changes to __main__.py:

$ python3 -m llama_cpp.server
llama.cpp: loading model from /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required  = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size  = 1280.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [105410]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://192.168.1.73:8000 (Press CTRL+C to quit)

nvidia-smi reports python PID # 105410, same as the server above:

$ nvidia-smi 
Tue Apr 25 06:46:15 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti      On | 00000000:09:00.0 Off |                  N/A |
| 25%   38C    P5               12W / 250W|    226MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    105410      C   python3                                     222MiB |
+---------------------------------------------------------------------------------------+

But no GPU utilisation when I hit the API 😞

The sun sets on our plans,
A complex web of delusions,
We cannot comprehend (0m23s)
@abetlen
Copy link
Owner

abetlen commented Apr 25, 2023

@gjmulder what does the utilisation look like when running cuBLAS with the llama.cpp examples?

@gjmulder
Copy link
Contributor Author

gjmulder commented Apr 25, 2023

Currently exploring different batch sizes using ./perplexity to see if I can squeeze out some more performance. ./perplexity was compiled by running make in /vendor/llama.cpp/:

vendor/llama.cpp$ ldd ./perplexity | grep cublas
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007fa2e2800000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007fa2bfe00000)

nvidia-smi output every 1 second:

Card,Instance,GPU Util %, Mem Util %, Mem Used
NVIDIA GeForce GTX 1080 Ti, 0, 81 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 10 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 13 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 57 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 69 %, 6 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 12 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 64 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 96 %, 11 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 99 %, 8 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 57 %, 3 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 75 %, 7 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 11 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 95 %, 10 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 81 %, 6 %, 2552 MiB
NVIDIA GeForce GTX 1080 Ti, 0, 93 %, 10 %, 2552 MiB

@gjmulder
Copy link
Contributor Author

gjmulder commented May 2, 2023

See #131. It looks to be and issue with libllama.so

@abetlen abetlen added the bug Something isn't working label May 5, 2023
@Free-Radical
Copy link

Hi, update I compilied per instructions for GPU-

LLAMA_CUBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python

No errors. But the when running the model in jupyter notebook, i clearly see the model load with BLAS=0, and run a lot slower than compared to lllama.cpp (cuBLAS enabled)

Did i do something wrong or we have an issue?

@joelkurian
Copy link
Contributor

@Free-Radical Try with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python.

@Free-Radical
Copy link

@Free-Radical Try with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python.

Did not work, BLAS = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants