Skip to content

Bug: CUDA error: out of memory - Phi-3 Mini 128k prompted with 20k+ tokens on 4GB GPU #7885

Closed
@kozuch

Description

@kozuch

What happened?

I get a CUDA out of memory error when sending large prompt (about 20k+ tokens) to Phi-3 Mini 128k model on laptop with Nvidia A2000 4GB RAM. At first about 3.3GB GPU RAM and 8GB CPU RAM is used by ollama, then the GPU ram usage slowly rises (3.4, 3.5GB etc.) and after about a minute it throws the error when GPU ram is exhaused probably (3.9GB is latest in task manager). The inference does not return any token (as answer) before crashing. Attaching server log. Using on Win11 + Ollama 0.1.42 + VS Code (1.90.0) + Continue plugin (v0.8.40).

The expected behavior would be not crashing and maybe rellocating the memory somehow so that GPU memory does not get exhausted. I want to disable GPU usage in ollama (to test for CPU inference only - I have 64GB RAM) but I am not able to find out how to turn the GPU off (even though I saw there is a command for it recently - am not able to find it again).

Actual error:

CUDA error: out of memory
  current device: 0, in function alloc at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:375
  cuMemSetAccess(pool_addr + pool_size, reserve_size, &access, 1)
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error"

This is reported via Ollama and full logs are in the issue there: ollama/ollama#4985

Name and Version

See linked ollama issue.

What operating system are you seeing the problem on?

Windows

Relevant log output

See linked ollama issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions