GitHub - kth8/llama-server: llama.cpp server + small language model in Docker container

llama.cpp server and small language model bundled together inside a Docker image for easy deployment similar to a llamafile. Uses CPU for inference. Requires CPU with AVX2 support from Intel Haswell/AMD Excavator or later generations. Override server settings with environment variables.

docker run -d --name llama1b --init -p 8001:8080/tcp ghcr.io/kth8/llama-server:llama-3.2-1b-instruct

Verify if the server is running by going to http://127.0.0.1:8001 in your web browser or using the terminal:

curl http://127.0.0.1:8001/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'

Check if your CPU supports AVX2 on Linux:

grep -o 'avx2' /proc/cpuinfo

Available tags:

ghcr.io/kth8/llama-server:llama-3.2-1b-instruct

ghcr.io/kth8/llama-server:llama-3.2-3b-instruct

ghcr.io/kth8/llama-server:amd-olmo-1b-sft-dpo

ghcr.io/kth8/llama-server:qwen2.5-coder-0.5b-instruct

ghcr.io/kth8/llama-server:qwen2.5.1-coder-1.5b-instruct

ghcr.io/kth8/llama-server:qwen2.5-coder-3b-instruct

ghcr.io/kth8/llama-server:hermes-3-llama-3.2-3b

ghcr.io/kth8/llama-server:granite-3.1-1b-a400m-instruct

ghcr.io/kth8/llama-server:granite-3.1-3b-a800m-instruct

ghcr.io/kth8/llama-server:fastllama-3.2-1b-instruct

ghcr.io/kth8/llama-server:deepseek-r1-distill-qwen-1.5b

ghcr.io/kth8/llama-server:deepseek-r1-redistill-qwen-1.5b-v1.0

ghcr.io/kth8/llama-server:nvidia_aceinstruct-1.5b

ghcr.io/kth8/llama-server:agentica-org_deepscaler-1.5b-preview

ghcr.io/kth8/llama-server:microsoft_phi-4-mini-instruct

ghcr.io/kth8/llama-server:google_gemma-3-1b-it

ghcr.io/kth8/llama-server:all-hands_openhands-lm-1.5b-v0.1

ghcr.io/kth8/llama-server:google_gemma-3-4b-it

ghcr.io/kth8/llama-server:deepcogito_cogito-v1-preview-llama-3b

ghcr.io/kth8/llama-server:zyphra_zr1-1.5b

ghcr.io/kth8/llama-server:agentica-org_deepcoder-1.5b-preview

ghcr.io/kth8/llama-server:ibm-granite_granite-3.3-2b-instruct

ghcr.io/kth8/llama-server:qwen_qwen3-0.6b

ghcr.io/kth8/llama-server:qwen_qwen3-1.7b

ghcr.io/kth8/llama-server:microsoft_phi-4-mini-reasoning

ghcr.io/kth8/llama-server:baidu_ernie-4.5-0.3b-pt

ghcr.io/kth8/llama-server:huggingfacetb_smollm3-3b

ghcr.io/kth8/llama-server:menlo_lucy

ghcr.io/kth8/llama-server:menlo_jan-nano

ghcr.io/kth8/llama-server:nvidia_openreasoning-nemotron-1.5b

ghcr.io/kth8/llama-server:lgai-exaone_exaone-4.0-1.2b

ghcr.io/kth8/llama-server:qwen_qwen3-4b-instruct-2507

ghcr.io/kth8/llama-server:google_gemma-3-270m-it

ghcr.io/kth8/llama-server:janhq_jan-v1-4b

ghcr.io/kth8/llama-server:qwen_qwen3-4b-thinking-2507

All model GGUF files provided by bartowski.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
.github/workflows		.github/workflows
Containerfile		Containerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

License

kth8/llama-server

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages