Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
842a68f
add wordstopper function
YellowRoseCx Apr 14, 2023
544567e
add wordstopper function
YellowRoseCx Apr 14, 2023
80d7f86
Create wordstoppers.txt
YellowRoseCx Apr 14, 2023
106faaf
cmake : add finding the OpenBLAS header file (#992)
katsu560 Apr 15, 2023
c12b14b
benchmark : fix result validation in benchmark-q4_0-matmult (#987)
dfyz Apr 15, 2023
7df4922
Delete make_pyinstaller.bat
YellowRoseCx Apr 15, 2023
d2c7f2d
Update koboldcpp.py
YellowRoseCx Apr 15, 2023
be913aa
Add files via upload
YellowRoseCx Apr 15, 2023
aa485ce
ggml : use posix_memalign on non-Windows env
ggerganov Apr 15, 2023
4f07ad8
Update README.md
YellowRoseCx Apr 15, 2023
50d2815
Update README.md
YellowRoseCx Apr 15, 2023
e6756eb
Update wordstoppers.txt
YellowRoseCx Apr 15, 2023
e95b655
ggml : add Q8_0 quantization for intermediate results (#951)
ggerganov Apr 15, 2023
0ad9646
Refactor ggml.c for future tensor types (#1001)
sw Apr 15, 2023
2f7c8e0
Fix potential int8 overflow in non-SIMD vec_dot (#986)
sw Apr 15, 2023
74f5899
convert.py: Fix loading safetensors and ggml format on Windows (#991)
comex Apr 15, 2023
472ee3f
Update README.md
YellowRoseCx Apr 15, 2023
4399c55
added wordstopper
YellowRoseCx Apr 15, 2023
032180b
add wordstopper function
YellowRoseCx Apr 15, 2023
71dc2a5
Merge branch 'base' into upstreamchanges
YellowRoseCx Apr 16, 2023
8acd559
Merge pull request #3 from ggerganov/master
YellowRoseCx Apr 16, 2023
2bad163
Delete CMakeLists.txt
YellowRoseCx Apr 16, 2023
1a43981
Merge branch 'upstreamchanges' into ggerganov
YellowRoseCx Apr 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# koboldcpp (formerly llamacpp-for-kobold)
# koboldcpp (wordstopper fork)

A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint.

Expand All @@ -8,22 +8,22 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin

# Highlights
- Now has experimental CLBlast support.
- Now has a token stopper to speed up generation by reducing wasted tokens

## Usage
- [Download the latest release here](https://github.com/LostRuins/koboldcpp/releases/latest) or clone the repo.
- Windows binaries are provided in the form of **koboldcpp.exe**, which is a pyinstaller wrapper for a few **.dll** files and **koboldcpp.py**. If you feel concerned, you may prefer to rebuild it yourself with the provided makefiles and scripts.
- Weights are not included, you can use the official llama.cpp `quantize.exe` to generate them from your official weight files (or download them from other places).
- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite.
- Download or clone the repo https://github.com/YellowRoseCx/koboldcpp-wordstopper
- Windows binaries are provided in the form of a few **.dll** files and **koboldcpp.py**. Linux and OSX need built.
- To run, open command prompt or terminal in the koboldcpp-wordstopper directory then launch with "python koboldcpp.py models/ggml_model_name.bin" and then connect with Kobold or Kobold Lite. Please check `python koboldcpp.py --help` for more info
- By default, you can connect to http://localhost:5001
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
- If you are having crashes or issues with OpenBLAS, please try the `--noblas` flag.
- Add the names or tokens you wish to use as stoppers to the wordstoppers.txt file. If it's a chat name, make sure to add the colon afterwards.


## OSX and Linux
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- To link with your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
- For Arch Linux: Install `cblas` and `openblas`. In the makefile, find the `ifdef LLAMA_OPENBLAS` conditional and add `-lcblas` to `LDFLAGS`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`. If you get a clbas_sgemm error, add -lcblas like in the Arch instructions.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`

## Considerations
Expand All @@ -41,4 +41,3 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin

## Notes
- Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux.
- I have heard of someone claiming a false AV positive report. The exe is a simple pyinstaller bundle that includes the necessary python scripts and dlls to run. If this still concerns you, you might wish to rebuild everything from source code using the makefile, and you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat`
Binary file modified clblast.dll
Binary file not shown.
6 changes: 4 additions & 2 deletions convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -735,7 +735,7 @@ def lazy_load_safetensors_file(fp: IO[bytes], path: Path) -> ModelPlus:
header: Dict[str, Dict[str, Any]] = json.loads(fp.read(header_size))
# Use mmap for the actual data to avoid race conditions with the file offset.
mapped = memoryview(mmap.mmap(fp.fileno(), 0, access=mmap.ACCESS_READ))
byte_buf = mapped[fp.tell():]
byte_buf = mapped[8 + header_size:]

def convert(info: Dict[str, Any]) -> LazyTensor:
data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
Expand All @@ -761,7 +761,7 @@ def must_read(fp: IO[bytes], length: int) -> bytes:
return ret


def lazy_load_ggml_file(fp: IO[bytes], path: Path) -> ModelPlus:
def lazy_load_ggml_file(fp: io.BufferedReader, path: Path) -> ModelPlus:
magic = must_read(fp, 4)[::-1]
if magic in (b'ggmf', b'ggjt'):
version, = struct.unpack("i", must_read(fp, 4))
Expand Down Expand Up @@ -795,7 +795,9 @@ def lazy_load_ggml_file(fp: IO[bytes], path: Path) -> ModelPlus:

model: LazyModel = {}
# Use mmap for the actual data to avoid race conditions with the file offset.
off = fp.raw.tell()
mapped = memoryview(mmap.mmap(fp.fileno(), 0, access=mmap.ACCESS_READ))
fp.raw.seek(off) # needed on Windows

def read_tensor() -> None: # this is a function so that variables captured in `load` don't change
shape_len, name_len, ftype = struct.unpack("iii", must_read(fp, 12))
Expand Down
2 changes: 1 addition & 1 deletion examples/benchmark/benchmark-q4_0-matmult.c
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

float tensor_sum_elements(struct ggml_tensor * tensor) {
float sum = 0;
if (tensor->type==6) {
if (tensor->type==GGML_TYPE_F32) {
for (int j = 0; j < tensor->ne[1]; j++) {
for (int k = 0; k < tensor->ne[0]; k++) {
sum += ((float *) tensor->data)[j*tensor->ne[0]+k];
Expand Down
Loading