Skip to content

Error - not enough space in the context's memory pool #2404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
omarelanis opened this issue Jul 26, 2023 · 4 comments
Closed

Error - not enough space in the context's memory pool #2404

omarelanis opened this issue Jul 26, 2023 · 4 comments

Comments

@omarelanis
Copy link

Expected Behavior

Type in a question and answer is retrieved from LLM model

Current Behavior

Instantly receive the following error:
ggml_new_object: not enough space in the context's memory pool (needed 10882896, available 10650320)

Environment and Context

Tried a combination of settings but just keep getting the memory error even though both RAM and GPU RAM are less than 50% utilization.

I had to follow the guide here to build llama-cpp with GPU support as it wasn't working previously, but even before that it was giving the same error (side note GPU support natively does work in oobabooga windows!?):
abetlen/llama-cpp-python#182

HW:
Windows 11
Intel i9-10900K OC @5.3GHz
64GB DDR4-2400 / PC4-19200
12GB Nvidia GeForce RTX 3060
Python 3.10.0

Using embedded DuckDB with persistence: data will be stored in: db
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama.cpp: loading model from models/llama7b/llama-deus-7b-v3.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2927.79 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1470 MB
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

What would you like to know about the policies?

test

ggml_new_object: not enough space in the context's memory pool (needed 10882896, available 10650320)
Traceback (most recent call last):
File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 84, in
main()
File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 55, in main
res = qa(query)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\retrieval_qa\base.py", line 133, in _call
answer = self.combine_documents_chain.run(
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 441, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\base.py", line 106, in _call
output, extra_return_dict = self.combine_docs(
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\stuff.py", line 165, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 252, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 92, in _call
response = self.generate([inputs], run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 102, in generate
return self.llm.generate_prompt(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 188, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 281, in generate
output = self._generate_helper(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 225, in _generate_helper
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 212, in _generate_helper
self._generate(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 604, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 229, in _call
for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 279, in stream
for chunk in result:
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 899, in _create_completion
for token in self.generate(
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 721, in generate
self.eval(tokens)
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 461, in eval
return_code = llama_cpp.llama_eval(
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama_cpp.py", line 678, in llama_eval
return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads)
OSError: exception: access violation reading 0x0000000000000000

@slaren
Copy link
Member

slaren commented Jul 26, 2023

@omarelanis I cannot reproduce this with main, so I can only guess that it is something related to the python bindings that you are using. Can you give me a command line that reproduces this issue using the tools in this repository, like the main example?

@omarelanis
Copy link
Author

@slaren firstly thank you so much for the quick response!

I've followed your advise and tested with a build from the main repo, and running from the command line it does load everything correctly including loading the GPU support:

`PS H:\AI_Projects\llamaCppCudaBuild\llama.cpp\build\bin\Release> .\main.exe -m H:\AI_Projects\Indexer_Plus_GPT\models\llama7b\llama-deus-7b-v3.ggmlv3.q4_0.bin -n -1 --color -r "User:" --in-prefix " " -e --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser: how far is the sun?"
main: build = 914 (5488fb7)
main: seed = 1690388183
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama.cpp: loading model from H:\AI_Projects\Indexer_Plus_GPT\models\llama7b\llama-deus-7b-v3.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 3917.73 MB (+ 256.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
llama_model_load_internal: total VRAM used: 288 MB
llama_new_context_with_model: kv self size = 256.00 MB

system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

User: Hi
AI: Hello. I am an AI chatbot. Would you like to talk?
User: Sure!
AI: What would you like to talk about?
User: how far is the sun?
AI: The Sun is approximately 93 million miles away from Earth. [end of text]

llama_print_timings: load time = 680.95 ms
llama_print_timings: sample time = 2.78 ms / 17 runs ( 0.16 ms per token, 6110.71 tokens per second)
llama_print_timings: prompt eval time = 1084.22 ms / 48 tokens ( 22.59 ms per token, 44.27 tokens per second)
llama_print_timings: eval time = 2312.13 ms / 16 runs ( 144.51 ms per token, 6.92 tokens per second)
llama_print_timings: total time = 3405.05 ms`

However, when install via pip using command pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir which is being used in langchain it doesn't include GPU support, and for some reason calling it using langchain fails with the error in the original issue I logged at the top.

For the most part the langchain code is just running this:

from langchain.llms import LlamaCpp
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=True, n_gpu_layers=n_gpu_layers)

Could you point in which direction is likely causing this error, langchain related?

@slaren
Copy link
Member

slaren commented Jul 26, 2023

You could look into what parameters are being passed to llama_eval, this could be caused by a batch size larger than n_batch (n_tokens), or a value of n_past higher than n_ctx. We should probably add some checks.

@omarelanis
Copy link
Author

I think I've found the issue, the workaround in the link I provided before (abetlen/llama-cpp-python#182) is using the latest version 0.1.77 of llama_cpp_python which is causing the issue, reverting back to 0.1.68 fixes the issue but stops the BLAS Cuda support for the GPU.

Thank you for your help so far, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants