-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Feature] Non-blocking fastAPI server #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Need an example of multiple works using uvicorn |
This is possible but would require the user to load multiple copies of same model (could share cache though), will look into this. |
I were able to make it do async with specifying multiple workers, will this reuse the same copies of the model under the hood ? I'm not good with low-level stuff |
With memory mapping multiple llama.cpp instances are able to share the same weights. Should be possible for multiple parallel api requests too.
|
hi what do you mean, so is this process automatically? like if you specify the same weight (ggml bin file) between each process it will automatically recognize that it is the same? |
I use a simple server like this from llama_cpp.server.app import create_app, Settings
import tomllib
import tomllib
with open("config.toml", "rb") as f:
settings = tomllib.load(f)
settings = Settings(**settings)
app = create_app(settings=settings) with settings are just n_ctx and model name and run using uvicorn with this settings uvicorn main:app --host 0.0.0.0 --port 3000 --workers 2 I do not see significant increase in RAM usage and it can handle concurrently |
As long as the parameter use_mmap is set and the model file is the same it could work.
|
When you |
@jmtatsch and @gjmulder are correct about the mmap behaviour, what doesn't work with this approach and will need some extra thought is the caching behaviour. While the weights are the same the state internal to the llama.cpp context object is not and neither is the cache shared between processes. Ideally we would have a shared prompt cache for multiple copies of the same model and (probably much more difficult) a way to match to the best available model to minimize token processing. |
@abetlen GPT-4 suggested:
EDIT: Removed shelve as it does disk I/O. |
Hi so what i understand here is that the weight itself will be loaded one, but the cache of the inference will be duplicated? So in a sense this temporary approach will still be somewhat usable, just not optimal? |
@tikikun yup that's correct |
So if we have proper cache method for multi-processes the server would be very usable and even production ready, very interesting. |
I know this issue is a couple weeks old, but some way to handle parallel requests would help a lot. For example, if your query returns a large number of tokens, but you want to stop generating, you have to wait for the previous generation to finish. A way to "cancel" inference would be fantastic, maybe even automatically if >1 request comes in and a special flag is set. |
I've converted the route handlers to @oceanplexian I've looked into this, unfortunately I don't think there's a way for uvicorn to cancel running requests like this. |
|
That would be sweet. Can we make the KV cache persistent on disk? So we'd have:
With a 7B model and sufficient VRAM (e.g. 11GB) you could run multiple instances on the GPU. You'd just have to tune With the 3B Open Llama model this is doable even with say 6GB of VRAM. 🤔 |
@gjmulder that's the idea, diskcache can keep the LlamaState stored on disk and then the server loads the state when there's a prefix match. |
Are you planning any updates in the near future? I am very interested in the ability to use this API as drop-in replacement for OpenAI. With multiple concurrent connections and falcon LLM we can probably scale that to production. |
What makes falcon LLM special? |
|
Intercepter token output. how to solve this problem current? who can give a sota idea. |
Uh oh!
There was an error while loading. Please reload this page.
Issues:
The server is currently blocking, which means that it only took another input after it has finished generated one api call, this is not expected from an api server.
Suggestion:
Maybe enable multi-processing since this is CPU bound.
The text was updated successfully, but these errors were encountered: