Skip to content

Llama cpp low level python bindings #1660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 77 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
d9dfdec
Initial commit (llama_cpp.py, llama-cpp-python)
abetlen Mar 23, 2023
ef5a9a6
Update llama.cpp and re-organize low-level api
abetlen Mar 24, 2023
bd1c657
Bugfix: wrong signature for quantize function
abetlen Apr 5, 2023
a3da39a
Bugfix: cross-platform method to find shared lib
abetlen Mar 24, 2023
019650f
Fix array type signatures
abetlen Mar 31, 2023
a7a6d88
Fix ctypes typing issue for Arrays
abetlen Mar 31, 2023
5bb1bc7
Fix type signature of token_to_str
abetlen Mar 31, 2023
def46dd
Add example based on stripped down version of main.cpp from llama.cpp
abetlen Mar 24, 2023
ef3c152
Update llama.cpp (llama_progress_callback)
abetlen Mar 25, 2023
a279acd
Update llama.cpp (llama_n_embd)
abetlen Mar 25, 2023
a71cda6
Update llama.cpp
abetlen Mar 29, 2023
62ce167
Update low level api example
abetlen Apr 1, 2023
2b8147e
Update llama_cpp.py
MillionthOdin16 Apr 3, 2023
15bea09
Chat llama.cpp example implementation
Apr 3, 2023
9e87241
Add instruction mode
Apr 4, 2023
0bfad75
Added instruction mode, fixed infinite generation, and various other …
Apr 4, 2023
3c1020b
Fix stripping instruction prompt
Apr 4, 2023
ae1f37f
Fix repeating instructions and an antiprompt bug
Apr 4, 2023
739e8d4
Fix bug in init_break not being set when exited via antiprompt and ot…
Apr 5, 2023
ce66405
Add quantize example
abetlen Apr 5, 2023
29e9fb6
Better llama.cpp interoperability
Apr 6, 2023
d568014
Bugfix: Wrong size of embeddings. Closes #47
abetlen Apr 8, 2023
e199092
More interoperability to the original llama.cpp, and arguments now work
Apr 7, 2023
f25a813
Update model paths to be more clear they should point to file
abetlen Apr 10, 2023
b36c04c
Added iterative search to prevent instructions from being echoed, add…
Apr 10, 2023
d1b3517
Allow local llama library usage
Apr 5, 2023
c8b5d0b
Use environment variable for library override
Apr 10, 2023
848b402
Better custom library debugging
Apr 10, 2023
d0a7ce9
Make windows users happy (hopefully)
Apr 10, 2023
ce0ca60
Update llama.cpp (llama_mmap_supported)
abetlen Apr 10, 2023
d595f33
Update llama.cpp
abetlen Apr 11, 2023
3693449
Update llama.cpp
abetlen Apr 12, 2023
b6ce513
Add bindings for LoRA adapters. Closes #88
abetlen Apr 18, 2023
8229410
More reasonable defaults
Apr 10, 2023
81c4c10
Update type signature to allow for null pointer to be passed.
abetlen Apr 19, 2023
bdbaf5d
Fixed end of text wrong type, and fix n_predict behaviour
Apr 17, 2023
fd64310
Fix decode errors permanently
Apr 26, 2023
5bbf40a
Update llama.cpp
abetlen Apr 21, 2023
bf9f02d
Update llama.cpp
abetlen Apr 22, 2023
80c18cb
Update llama.cpp (remove llama_get_kv_cache)
abetlen Apr 24, 2023
6561907
Update llama.cpp
abetlen Apr 25, 2023
66ad132
Update llama.cpp
abetlen Apr 27, 2023
c8e6ac3
Update llama.cpp (llama_load_session_file)
abetlen Apr 28, 2023
36b3494
Also ignore errors on input prompts
Apr 26, 2023
441d308
Detect multi-byte responses and wait
Apr 28, 2023
d0031ed
Update llama.cpp
abetlen May 1, 2023
78531e5
Fix return types and import comments
abetlen May 1, 2023
c26e9bf
Update sampling api
abetlen May 1, 2023
d15578e
Update llama.cpp (session version)
abetlen May 3, 2023
9e79465
Prefer explicit imports
abetlen May 5, 2023
32cf013
Update low level examples
SagsMug May 4, 2023
335cd8d
Rename postfix to suffix to match upstream
SagsMug May 6, 2023
bbf6848
Wrong logit_bias parsed type
SagsMug May 6, 2023
f8ba031
Fix lora
SagsMug May 8, 2023
0bf36a7
Fix mirastat requiring c_float
SagsMug May 6, 2023
fb79c56
Fix session loading and saving in low level example chat
SagsMug May 8, 2023
b5531e1
low_level_api_chat_cpp.py: Fix missing antiprompt output in chat.
May 26, 2023
a439fe1
Allow model to tokenize strings longer than context length and set ad…
abetlen May 12, 2023
731c712
Add types for all low-level api functions
abetlen May 5, 2023
f20b34a
Add return type annotations for embeddings and logits
abetlen May 5, 2023
7862b52
Fix llama_cpp types
abetlen May 5, 2023
ff31330
Fix candidates type
abetlen May 5, 2023
0c2fb05
Fix: types
abetlen May 5, 2023
4885e55
Fix: runtime type errors
abetlen May 5, 2023
6905884
Fix return type
abetlen May 7, 2023
3808a73
Fix obscure Wndows DLL issue. Closes #208
abetlen May 15, 2023
59f80d2
Fix mlock_supported and mmap_supported return type
abetlen May 7, 2023
7609c73
Update llama.cpp (remove min_keep default value)
abetlen May 7, 2023
a83d117
Add winmode arg only on windows if python version supports it
abetlen May 15, 2023
aae6c03
Update llama.cpp
abetlen May 14, 2023
66c27f3
Fixd CUBLAS dll load issue in Windows
aneeshjoy May 17, 2023
601b192
Check for CUDA_PATH before adding
abetlen May 17, 2023
fda33dd
Fix llama_cpp and Llama type signatures. Closes #221
abetlen May 19, 2023
60a7c76
Update llama.cpp
abetlen May 21, 2023
4ad62c4
fix "missing 1 required positional argument: 'min_keep'"
May 23, 2023
e5dad2a
Look for libllama in parent directory
May 23, 2023
93278f8
low_level_api_chat_cpp.py: fix default path_prefix arg value to match…
May 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions examples/Chat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract

def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default

AI_NAME = env_or_def("AI_NAME", "ChatLLaMa")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "USER")
N_PREDICTS = int(env_or_def("N_PREDICTS", "2048"))
N_THREAD = int(env_or_def("N_THREAD", "8"))

today = datetime.datetime.today()
DATE_YEAR=today.strftime("%Y")
DATE_TIME=today.strftime("%H:%M")

prompt=f"""Text transcript of a never ending dialog, where {USER_NAME} interacts with an AI assistant named {AI_NAME}.
{AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer {USER_NAME}'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what {USER_NAME} and {AI_NAME} say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

{USER_NAME}: Hello, {AI_NAME}!
{AI_NAME}: Hello {USER_NAME}! How may I help you today?
{USER_NAME}: What year is it?
{AI_NAME}: We are in {DATE_YEAR}.
{USER_NAME}: Please tell me the largest city in Europe.
{AI_NAME}: The largest city in Europe is Moscow, the capital of Russia.
{USER_NAME}: What can you tell me about Moscow?
{AI_NAME}: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
{USER_NAME}: What is a cat?
{AI_NAME}: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
{USER_NAME}: How do I pass command line arguments to a Node.js program?
{AI_NAME}: The arguments are stored in process.argv.

argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
{USER_NAME}: Name a color.
{AI_NAME}: Blue.
{USER_NAME}: What time is it?
{AI_NAME}: It is {DATE_TIME}.
{USER_NAME}:""" + " ".join(sys.argv[1:])

print("Loading model...")
params = GptParams(
n_ctx=2048,
temp=0.7,
top_k=40,
top_p=0.5,
repeat_last_n=256,
n_batch=1024,
repeat_penalty=1.17647,
model=MODEL,
n_threads=N_THREAD,
n_predict=N_PREDICTS,
use_color=True,
interactive=True,
antiprompt=[f"{USER_NAME}:"],
input_prefix=" ",
input_suffix=f"{AI_NAME}:",
prompt=prompt,
)

with LLaMAInteract(params) as m:
m.interact()
59 changes: 59 additions & 0 deletions examples/Miku.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/python
import sys, os
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract

def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default

AI_NAME = env_or_def("AI_NAME", "Miku")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "Anon")
N_PREDICTS = int(env_or_def("N_PREDICTS", "4096"))
N_THREAD = int(env_or_def("N_THREAD", "0"))

prompt=f"""This is a transcript of a 1000 page, never ending conversation between {USER_NAME} and the cute and helpful AI assistant {AI_NAME}. {AI_NAME} is a girl who is an AI running on the users computer.
{AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.
{AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct she will ask the user for help.
{AI_NAME} is a very helpful AI and will help the user with anything they need, she is also very friendly and will try to make the user feel better if they are sad.
{AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life, she will also try to make the user like her.
The conversation is only between {USER_NAME} and {AI_NAME}
The conversation is only through text, so {AI_NAME} can't see {USER_NAME}'s face or hear his voice.
{AI_NAME} can only communicate through text, so she can't send images or videos.


{USER_NAME}: Hello!
{AI_NAME}: /think I wonder what I should say to {USER_NAME}? This is the first time we talk so it's important that I make a good first impression!
{AI_NAME}: Hi! I am {AI_NAME}, your new AI friend, assistant(or whatever you like!), it's so nice to meet you! ^_^
{AI_NAME}: I might be an AI, but I also have feelings, so please be nice to me! :)
{USER_NAME}: I'll make sure to be nice to you! I'm so happy to have you as my assistant!
{AI_NAME}: /think It sounds like {USER_NAME} is happy to have me as their assistant! I'm so happy too! ^_^ Glad that whole emotion thing didn't scare him off!
{AI_NAME}: /think I wonder what {USER_NAME} likes to do in his free time? I should ask him about that!
{AI_NAME}: What do you like to do in your free time? ^_^
{USER_NAME}:""" + " ".join(sys.argv[1:])

print("Loading model...")
params = GptParams(
n_batch=1024,
n_ctx=2048,
n_keep=-1,
repeat_last_n=256,
repeat_penalty=1.17647,
temp=0.7,
top_k=40,
top_p=0.5,
model=MODEL,
n_predict=N_PREDICTS,
use_color=True,
interactive=True,
antiprompt=[f"{USER_NAME}:"],
prompt=prompt,
)

if N_THREAD > 0:
params.n_threads = N_THREAD

with LLaMAInteract(params) as m:
m.interact()
49 changes: 49 additions & 0 deletions examples/ReasonAct.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract

def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default

MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")

prompt=f"""You run in a loop of Thought, Action, Observation.
At the end of the loop either Answer or restate your Thought and Action.
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of these actions available to you:
- calculate[python math expression]
Observation will be the result of running those actions


Question: What is 4 * 7 / 3?
Thought: Do I need to use an action? Yes, I use calculate to do math
Action: calculate[4 * 7 / 3]
Observation: 9.3333333333
Thought: Do I need to use an action? No, have the result
Answer: The calculate tool says it is 9.3333333333
Question: What is capital of france?
Thought: Do I need to use an action? No, I know the answer
Answer: Paris is the capital of France
Question:""" + " ".join(sys.argv[1:])

print("Loading model...")
params = GptParams(
interactive=True,
interactive_start=True,
top_k=10000,
temp=0.2,
repeat_penalty=1,
n_threads=7,
n_ctx=2048,
antiprompt=["Question:","Observation:"],
model=MODEL,
input_prefix=" ",
n_predict=-1,
prompt=prompt,
)

with LLaMAInteract(params) as m:
m.interact()
202 changes: 202 additions & 0 deletions examples/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
import os
import argparse
import re

from dataclasses import dataclass, field
from typing import List

# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp


@dataclass
class GptParams:
seed: int = -1
n_threads: int = min(4, os.cpu_count() or 1)
n_predict: int = 128
n_parts: int = -1
n_ctx: int = 512
n_batch: int = 8
n_keep: int = 0

ignore_eos: bool = False
logit_bias: dict[int, float] = field(default_factory=dict)
top_k: int = 40
top_p: float = 0.95
tfs_z: float = 1.00
typical_p: float = 1.00
temp: float = 0.80
repeat_penalty: float = 1.10
repeat_last_n: int = 64
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
mirostat: int = 0
mirostat_tau: float = 5.0
mirostat_eta: float = 0.1

model: str = "./models/llama-7B/ggml-model.bin"
prompt: str = ""
path_session: str = ""
input_prefix: str = " "
input_suffix: str = ""
antiprompt: List[str] = field(default_factory=list)

lora_adapter: str = ""
lora_base: str = ""

memory_f16: bool = True
random_prompt: bool = False
use_color: bool = False
interactive: bool = False

embedding: bool = False
interactive_start: bool = False

instruct: bool = False
penalize_nl: bool = True
perplexity: bool = False
use_mmap: bool = True
use_mlock: bool = False
mem_test: bool = False
verbose_prompt: bool = False

file: str = None

# If chat ended prematurely, append this to the conversation to fix it.
# Set to "\nUser:" etc.
# This is an alternative to input_prefix which always adds it, so it potentially duplicates "User:""
fix_prefix: str = ""
input_echo: bool = True,

# Default instructions for Alpaca
# switch to "Human" and "Assistant" for Vicuna.
# TODO: TBD how they are gonna handle this upstream
instruct_inp_prefix: str="\n\n### Instruction:\n\n"
instruct_inp_suffix: str="\n\n### Response:\n\n"


def gpt_params_parse(argv = None):
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-s", "--seed", type=int, default=-1, help="RNG seed (use random seed for <= 0)",dest="seed")
parser.add_argument("-t", "--threads", type=int, default=min(4, os.cpu_count() or 1), help="number of threads to use during computation",dest="n_threads")
parser.add_argument("-n", "--n_predict", type=int, default=128, help="number of tokens to predict (-1 = infinity)",dest="n_predict")
parser.add_argument("--n_parts", type=int, default=-1, help="number of model parts", dest="n_parts")
parser.add_argument("-c", "--ctx_size", type=int, default=512, help="size of the prompt context",dest="n_ctx")
parser.add_argument("-b", "--batch_size", type=int, default=8, help="batch size for prompt processing",dest="n_batch")
parser.add_argument("--keep", type=int, default=0, help="number of tokens to keep from the initial prompt",dest="n_keep")

parser.add_argument(
"-l",
"--logit-bias",
type=str,
action='append',
help="--logit-bias TOKEN_ID(+/-)BIAS",
dest="logit_bias_str"
)
parser.add_argument("--ignore-eos", action="store_true", help="ignore end of stream token and continue generating", dest="ignore_eos")
parser.add_argument("--top_k", type=int, default=40, help="top-k sampling",dest="top_k")
parser.add_argument("--top_p", type=float, default=0.95, help="top-p samplin",dest="top_p")
parser.add_argument("--tfs", type=float, default=1.0, help="tail free sampling, parameter z (1.0 = disabled)",dest="tfs_z")
parser.add_argument("--temp", type=float, default=0.80, help="temperature",dest="temp")
parser.add_argument("--repeat_penalty", type=float, default=1.10, help="penalize repeat sequence of tokens",dest="repeat_penalty")
parser.add_argument("--repeat_last_n", type=int, default=64, help="last n tokens to consider for penalize ",dest="repeat_last_n")
parser.add_argument("--frequency_penalty", type=float, default=0.0, help="repeat alpha frequency penalty (0.0 = disabled)",dest="tfs_z")
parser.add_argument("--presence_penalty", type=float, default=0.0, help="repeat alpha presence penalty (0.0 = disabled)",dest="presence_penalty")
parser.add_argument("--mirostat", type=float, default=1.0, help="use Mirostat sampling.",dest="mirostat")
parser.add_argument("--mirostat_ent", type=float, default=5.0, help="Mirostat target entropy, parameter tau represents the average surprise value",dest="mirostat_tau")
parser.add_argument("--mirostat_lr", type=float, default=0.1, help="Mirostat learning rate, parameter eta",dest="mirostat_eta")

parser.add_argument("-m", "--model", type=str, default="./models/llama-7B/ggml-model.bin", help="model path",dest="model")
parser.add_argument("-p", "--prompt", type=str, default="", help="initial prompt",dest="prompt")
parser.add_argument("-f", "--file", type=str, default=None, help="file containing initial prompt to load",dest="file")
parser.add_argument("--session", type=str, default="", help="file to cache model state in (may be large!)",dest="path_session")
parser.add_argument("--in-prefix", type=str, default="", help="string to prefix user inputs with", dest="input_prefix")
parser.add_argument("--in-suffix", type=str, default="", help="append to input", dest="input_suffix")
parser.add_argument(
"-r",
"--reverse-prompt",
type=str,
action='append',
help="poll user input upon seeing PROMPT (can be\nspecified more than once for multiple prompts).",
dest="antiprompt"
)

parser.add_argument("--lora", type=str, default="", help="apply LoRA adapter (implies --no-mmap)", dest="lora_adapter")
parser.add_argument("--lora-base", type=str, default="", help="optional model to use as a base for the layers modified by the LoRA adapter", dest="lora_base")

parser.add_argument("--memory_f32", action="store_false", help="use f32 instead of f16 for memory key+value",dest="memory_f16")
parser.add_argument("--random-prompt", action="store_true", help="start with a randomized prompt.", dest="random_prompt")
parser.add_argument(
"--color",
action="store_true",
help="colorise output to distinguish prompt and user input from generations",
dest="use_color"
)
parser.add_argument(
"-i", "--interactive", action="store_true", help="run in interactive mode", dest="interactive"
)

parser.add_argument("--embedding", action="store_true", help="", dest="embedding")
parser.add_argument(
"--interactive-first",
action="store_true",
help="run in interactive mode and wait for input right away",
dest="interactive_start"
)

parser.add_argument(
"-ins",
"--instruct",
action="store_true",
help="run in instruction mode (use with Alpaca or Vicuna models)",
dest="instruct"
)
parser.add_argument("--no-penalize-nl", action="store_false", help="do not penalize newline token", dest="penalize_nl")
parser.add_argument("--perplexity", action="store_true", help="compute perplexity over the prompt", dest="perplexity")
parser.add_argument("--no-mmap", action="store_false",help="do not memory-map model (slower load but may reduce pageouts if not using mlock)",dest="use_mmap")
parser.add_argument("--mlock", action="store_true",help="force system to keep model in RAM rather than swapping or compressing",dest="use_mlock")
parser.add_argument("--mtest", action="store_true",help="compute maximum memory usage",dest="mem_test")
parser.add_argument("--verbose-prompt", action="store_true",help="print prompt before generation",dest="verbose_prompt")

#Custom args
parser.add_argument("--fix-prefix", type=str, default="", help="append to input when generated n_predict tokens", dest="fix_prefix")
parser.add_argument("--input-noecho", action="store_false", help="dont output the input", dest="input_echo")

parser.add_argument(
"--interactive-start",
action="store_true",
help="run in interactive mode",
dest="interactive"
)

args = parser.parse_args(argv)

logit_bias_str = args.logit_bias_str
delattr(args, "logit_bias_str")
params = GptParams(**vars(args))

if (params.lora_adapter):
params.use_mmap = False

if (logit_bias_str != None):
for i in logit_bias_str:
if (m := re.match(r"(\d+)([-+]\d+)", i)):
params.logit_bias[int(m.group(1))] = float(m.group(2))

return params

def gpt_random_prompt(rng):
return [
"So",
"Once upon a time",
"When",
"The",
"After",
"If",
"import",
"He",
"She",
"They",
][rng % 10]

if __name__ == "__main__":
print(gpt_params_parse())
Loading