Skip to content

[Feature] Non-blocking fastAPI server #183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tikikun opened this issue May 11, 2023 · 23 comments
Closed

[Feature] Non-blocking fastAPI server #183

tikikun opened this issue May 11, 2023 · 23 comments

Comments

@tikikun
Copy link

tikikun commented May 11, 2023

Issues:
The server is currently blocking, which means that it only took another input after it has finished generated one api call, this is not expected from an api server.

Suggestion:
Maybe enable multi-processing since this is CPU bound.

@tikikun
Copy link
Author

tikikun commented May 11, 2023

Need an example of multiple works using uvicorn

@abetlen
Copy link
Owner

abetlen commented May 11, 2023

This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this.

@tikikun
Copy link
Author

tikikun commented May 11, 2023

This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this.

I were able to make it do async with specifying multiple workers, will this reuse the same copies of the model under the hood ? I'm not good with low-level stuff

@jmtatsch
Copy link

jmtatsch commented May 11, 2023 via email

@tikikun
Copy link
Author

tikikun commented May 11, 2023

With memory mapping multiple llama.cpp instances are able to share the same weights. Should be possible for multiple parallel api requests too.Am 11.05.2023 um 05:27 schrieb Andrei @.>: This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>

hi what do you mean, so is this process automatically? like if you specify the same weight (ggml bin file) between each process it will automatically recognize that it is the same?

@tikikun
Copy link
Author

tikikun commented May 11, 2023

I use a simple server like this

from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)

with settings are just n_ctx and model name

and run using uvicorn with this settings

uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2

I do not see significant increase in RAM usage and it can handle concurrently

@jmtatsch
Copy link

jmtatsch commented May 11, 2023 via email

@gjmulder
Copy link
Contributor

hi what do you mean, so is this process automatically? like if you specify the same weight (ggml bin file) between each process it will automatically recognize that it is the same?

When you mmap in a file from disk the OS handles the shared access to the file through its page cache. So if two processes mmap the same file the OS uses the same shared pages for each. You can see this in the SHR column when running top on Linux.

@abetlen
Copy link
Owner

abetlen commented May 11, 2023

I use a simple server like this

from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)

with settings are just n_ctx and model name

and run using uvicorn with this settings

uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2

I do not see significant increase in RAM usage and it can handle concurrently

@jmtatsch and @gjmulder are correct about the mmap behaviour, what doesn't work with this approach and will need some extra thought is the caching behaviour. While the weights are the same the state internal to the llama.cpp context object is not and neither is the cache shared between processes. Ideally we would have a shared prompt cache for multiple copies of the same model and (probably much more difficult) a way to match to the best available model to minimize token processing.

@gjmulder
Copy link
Contributor

gjmulder commented May 11, 2023

@abetlen GPT-4 suggested:

  1. Using multiprocessing: The multiprocessing module has a Manager class that can be used to create a server process which holds Python objects and allows other processes to manipulate them using proxies. This can serve as a shared memory object store, but it's not as efficient as Memcached.
  2. Using sqlite3 with :memory: database

EDIT: Removed shelve as it does disk I/O.

@tikikun
Copy link
Author

tikikun commented May 11, 2023

I use a simple server like this

from llama_cpp.server.app import create_app, Settings


import tomllib


import tomllib

with open("config.toml", "rb") as f:
    settings = tomllib.load(f)

settings = Settings(**settings)

app = create_app(settings=settings)

with settings are just n_ctx and model name
and run using uvicorn with this settings

uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2

I do not see significant increase in RAM usage and it can handle concurrently

@jmtatsch and @gjmulder are correct about the mmap behaviour, what doesn't work with this approach and will need some extra thought is the caching behaviour. While the weights are the same the state internal to the llama.cpp context object is not and neither is the cache shared between processes. Ideally we would have a shared prompt cache for multiple copies of the same model and (probably much more difficult) a way to match to the best available model to minimize token processing.

Hi so what i understand here is that the weight itself will be loaded one, but the cache of the inference will be duplicated? So in a sense this temporary approach will still be somewhat usable, just not optimal?

@abetlen
Copy link
Owner

abetlen commented May 11, 2023

@tikikun yup that's correct

@tikikun
Copy link
Author

tikikun commented May 11, 2023

@tikikun yup that's correct

So if we have proper cache method for multi-processes the server would be very usable and even production ready, very interesting.

@gjmulder gjmulder added enhancement New feature or request server performance labels May 12, 2023
@gjmulder gjmulder pinned this issue May 26, 2023
@oceanplexian
Copy link

I know this issue is a couple weeks old, but some way to handle parallel requests would help a lot.

For example, if your query returns a large number of tokens, but you want to stop generating, you have to wait for the previous generation to finish. A way to "cancel" inference would be fantastic, maybe even automatically if >1 request comes in and a special flag is set.

@abetlen
Copy link
Owner

abetlen commented May 27, 2023

I've converted the route handlers to async for the server, this handles disconnects slightly better, the server still can't process multiple requests concurrently unless you set --workers > 1 however this will multiply your RAM / VRAM requirements as each worker spawns another process with it's own memory space.

@oceanplexian I've looked into this, unfortunately I don't think there's a way for uvicorn to cancel running requests like this.

@gjmulder
Copy link
Contributor

however this will multiply your RAM / VRAM requirements as each worker spawns another process with it's own memory space.

mmap won't allow the sharing of at least CPU RAM?

@abetlen
Copy link
Owner

abetlen commented May 27, 2023

@gjmulder you're probably right actually, so yeah that's not a bad solution, additionally #279 should allow multiple processes to easily share a cache via the filesystem

@gjmulder
Copy link
Contributor

gjmulder commented May 27, 2023

additionally #279 should allow multiple processes to easily share a cache via the filesystem

That would be sweet. Can we make the KV cache persistent on disk?

So we'd have:

  • mmap shared CPU RAM models
  • KV cache shared via disk

With a 7B model and sufficient VRAM (e.g. 11GB) you could run multiple instances on the GPU. You'd just have to tune n_gpu_layers so that they all don't run out of total VRAM.

With the 3B Open Llama model this is doable even with say 6GB of VRAM. 🤔

@abetlen
Copy link
Owner

abetlen commented May 27, 2023

@gjmulder that's the idea, diskcache can keep the LlamaState stored on disk and then the server loads the state when there's a prefix match.

@generalsvr
Copy link

generalsvr commented Jun 1, 2023

Are you planning any updates in the near future? I am very interested in the ability to use this API as drop-in replacement for OpenAI. With multiple concurrent connections and falcon LLM we can probably scale that to production.

@gjmulder
Copy link
Contributor

gjmulder commented Jun 1, 2023

What makes falcon LLM special?

@generalsvr
Copy link

What makes falcon LLM special?

Apache 2.0

@xx-zhang
Copy link

Intercepter token output. how to solve this problem current? who can give a sota idea.

@abetlen abetlen closed this as completed in 4c7cdcc Jul 7, 2023
@abetlen abetlen unpinned this issue Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants