[Feature] Non-blocking fastAPI server #183

tikikun · 2023-05-11T02:04:18Z

Issues:
The server is currently blocking, which means that it only took another input after it has finished generated one api call, this is not expected from an api server.

Suggestion:
Maybe enable multi-processing since this is CPU bound.

tikikun · 2023-05-11T03:19:17Z

Need an example of multiple works using uvicorn

abetlen · 2023-05-11T03:26:55Z

This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this.

tikikun · 2023-05-11T03:40:25Z

This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this.

I were able to make it do async with specifying multiple workers, will this reuse the same copies of the model under the hood ? I'm not good with low-level stuff

jmtatsch · 2023-05-11T05:08:15Z

With memory mapping multiple llama.cpp instances are able to share the same weights. Should be possible for multiple parallel api requests too.

tikikun · 2023-05-11T05:51:44Z

With memory mapping multiple llama.cpp instances are able to share the same weights. Should be possible for multiple parallel api requests too.Am 11.05.2023 um 05:27 schrieb Andrei @.>: This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>

hi what do you mean, so is this process automatically? like if you specify the same weight (ggml bin file) between each process it will automatically recognize that it is the same?

tikikun · 2023-05-11T06:43:19Z

I use a simple server like this

from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)

with settings are just n_ctx and model name

and run using uvicorn with this settings

uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2

I do not see significant increase in RAM usage and it can handle concurrently

jmtatsch · 2023-05-11T08:57:13Z

As long as the parameter use_mmap is set and the model file is the same it could work.

gjmulder · 2023-05-11T09:07:39Z

hi what do you mean, so is this process automatically? like if you specify the same weight (ggml bin file) between each process it will automatically recognize that it is the same?

When you mmap in a file from disk the OS handles the shared access to the file through its page cache. So if two processes mmap the same file the OS uses the same shared pages for each. You can see this in the SHR column when running top on Linux.

abetlen · 2023-05-11T13:16:01Z

I use a simple server like this
from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)
with settings are just n_ctx and model name

and run using uvicorn with this settings
uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2
I do not see significant increase in RAM usage and it can handle concurrently

@jmtatsch and @gjmulder are correct about the mmap behaviour, what doesn't work with this approach and will need some extra thought is the caching behaviour. While the weights are the same the state internal to the llama.cpp context object is not and neither is the cache shared between processes. Ideally we would have a shared prompt cache for multiple copies of the same model and (probably much more difficult) a way to match to the best available model to minimize token processing.

gjmulder · 2023-05-11T13:31:44Z

@abetlen GPT-4 suggested:

Using multiprocessing: The multiprocessing module has a Manager class that can be used to create a server process which holds Python objects and allows other processes to manipulate them using proxies. This can serve as a shared memory object store, but it's not as efficient as Memcached.
Using sqlite3 with :memory: database

EDIT: Removed shelve as it does disk I/O.

tikikun · 2023-05-11T13:36:53Z

I use a simple server like this
from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)
with settings are just n_ctx and model name
and run using uvicorn with this settings
uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2
I do not see significant increase in RAM usage and it can handle concurrently
@jmtatsch and @gjmulder are correct about the mmap behaviour, what doesn't work with this approach and will need some extra thought is the caching behaviour. While the weights are the same the state internal to the llama.cpp context object is not and neither is the cache shared between processes. Ideally we would have a shared prompt cache for multiple copies of the same model and (probably much more difficult) a way to match to the best available model to minimize token processing.

Hi so what i understand here is that the weight itself will be loaded one, but the cache of the inference will be duplicated? So in a sense this temporary approach will still be somewhat usable, just not optimal?

abetlen · 2023-05-11T13:47:33Z

@tikikun yup that's correct

tikikun · 2023-05-11T13:50:39Z

@tikikun yup that's correct

So if we have proper cache method for multi-processes the server would be very usable and even production ready, very interesting.

oceanplexian · 2023-05-27T00:11:18Z

I know this issue is a couple weeks old, but some way to handle parallel requests would help a lot.

For example, if your query returns a large number of tokens, but you want to stop generating, you have to wait for the previous generation to finish. A way to "cancel" inference would be fantastic, maybe even automatically if >1 request comes in and a special flag is set.

abetlen · 2023-05-27T13:44:21Z

I've converted the route handlers to async for the server, this handles disconnects slightly better, the server still can't process multiple requests concurrently unless you set --workers > 1 however this will multiply your RAM / VRAM requirements as each worker spawns another process with it's own memory space.

@oceanplexian I've looked into this, unfortunately I don't think there's a way for uvicorn to cancel running requests like this.

gjmulder · 2023-05-27T13:51:52Z

however this will multiply your RAM / VRAM requirements as each worker spawns another process with it's own memory space.

mmap won't allow the sharing of at least CPU RAM?

abetlen · 2023-05-27T14:43:49Z

@gjmulder you're probably right actually, so yeah that's not a bad solution, additionally #279 should allow multiple processes to easily share a cache via the filesystem

gjmulder · 2023-05-27T14:57:25Z

additionally #279 should allow multiple processes to easily share a cache via the filesystem

That would be sweet. Can we make the KV cache persistent on disk?

So we'd have:

mmap shared CPU RAM models
KV cache shared via disk

With a 7B model and sufficient VRAM (e.g. 11GB) you could run multiple instances on the GPU. You'd just have to tune n_gpu_layers so that they all don't run out of total VRAM.

With the 3B Open Llama model this is doable even with say 6GB of VRAM. 🤔

abetlen · 2023-05-27T18:03:39Z

@gjmulder that's the idea, diskcache can keep the LlamaState stored on disk and then the server loads the state when there's a prefix match.

generalsvr · 2023-06-01T08:22:38Z

Are you planning any updates in the near future? I am very interested in the ability to use this API as drop-in replacement for OpenAI. With multiple concurrent connections and falcon LLM we can probably scale that to production.

gjmulder · 2023-06-01T08:31:49Z

What makes falcon LLM special?

generalsvr · 2023-06-01T08:57:36Z

What makes falcon LLM special?

Apache 2.0

xx-zhang · 2023-06-14T01:43:52Z

Intercepter token output. how to solve this problem current? who can give a sota idea.

gjmulder added enhancement New feature or request server performance labels May 12, 2023

gjmulder added the high-priority label May 23, 2023

gjmulder pinned this issue May 26, 2023

abetlen mentioned this issue Jun 9, 2023

[Updated issue] Prompt + Generation is limited to n_ctx. #331

Open

abetlen closed this as completed in 4c7cdcc Jul 7, 2023

abetlen unpinned this issue Jul 8, 2023

[Feature] Non-blocking fastAPI server #183

[Feature] Non-blocking fastAPI server #183

Comments

tikikun commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tikikun commented May 11, 2023

Uh oh!

abetlen commented May 11, 2023

Uh oh!

tikikun commented May 11, 2023

Uh oh!

jmtatsch commented May 11, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tikikun commented May 11, 2023

Uh oh!

tikikun commented May 11, 2023

Uh oh!

jmtatsch commented May 11, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjmulder commented May 11, 2023

Uh oh!

abetlen commented May 11, 2023

Uh oh!

gjmulder commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tikikun commented May 11, 2023

Uh oh!

abetlen commented May 11, 2023

Uh oh!

tikikun commented May 11, 2023

Uh oh!

oceanplexian commented May 27, 2023

Uh oh!

abetlen commented May 27, 2023

Uh oh!

gjmulder commented May 27, 2023

Uh oh!

abetlen commented May 27, 2023

Uh oh!

gjmulder commented May 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented May 27, 2023

Uh oh!

generalsvr commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjmulder commented Jun 1, 2023

Uh oh!

generalsvr commented Jun 1, 2023

Uh oh!

xx-zhang commented Jun 14, 2023

Uh oh!

tikikun commented May 11, 2023 •

edited

Loading

jmtatsch commented May 11, 2023 via email •

edited

Loading

jmtatsch commented May 11, 2023 via email •

edited

Loading

gjmulder commented May 11, 2023 •

edited

Loading

gjmulder commented May 27, 2023 •

edited

Loading

generalsvr commented Jun 1, 2023 •

edited

Loading