YellowRoseCx · YellowRoseCx · Apr 16, 2023 · Apr 14, 2023 · Apr 14, 2023 · Apr 14, 2023
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# koboldcpp (formerly llamacpp-for-kobold)
+# koboldcpp (wordstopper fork)
 
 A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint. 
 
@@ -8,22 +8,22 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
 
 # Highlights
 - Now has experimental CLBlast support.
+- Now has a token stopper to speed up generation by reducing wasted tokens
 
 ## Usage
-- [Download the latest release here](https://github.com/LostRuins/koboldcpp/releases/latest) or clone the repo.
-- Windows binaries are provided in the form of **koboldcpp.exe**, which is a pyinstaller wrapper for a few **.dll** files and **koboldcpp.py**. If you feel concerned, you may prefer to rebuild it yourself with the provided makefiles and scripts.
-- Weights are not included, you can use the official llama.cpp `quantize.exe` to generate them from your official weight files (or download them from other places).
-- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. 
+- Download or clone the repo https://github.com/YellowRoseCx/koboldcpp-wordstopper
+- Windows binaries are provided in the form of a few **.dll** files and **koboldcpp.py**. Linux and OSX need built.
+- To run, open command prompt or terminal in the koboldcpp-wordstopper directory then launch with "python koboldcpp.py models/ggml_model_name.bin" and then connect with Kobold or Kobold Lite. Please check `python koboldcpp.py --help` for more info
 - By default, you can connect to http://localhost:5001 
-- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help` 
 - If you are having crashes or issues with OpenBLAS, please try the `--noblas` flag.
+- Add the names or tokens you wish to use as stoppers to the wordstoppers.txt file. If it's a chat name, make sure to add the colon afterwards.
+
 
 ## OSX and Linux
-- You will have to compile your binaries from source. A makefile is provided, simply run `make`
-- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
+- To link with your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
 - Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
   - For Arch Linux: Install `cblas` and `openblas`. In the makefile, find the `ifdef LLAMA_OPENBLAS` conditional and add `-lcblas` to `LDFLAGS`.
-  - For Debian: Install `libclblast-dev` and `libopenblas-dev`.
+  - For Debian: Install `libclblast-dev` and `libopenblas-dev`. If you get a clbas_sgemm error, add -lcblas like in the Arch instructions.
 - After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
 
 ## Considerations
@@ -41,4 +41,3 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
 
 ## Notes
 - Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux.
-- I have heard of someone claiming a false AV positive report. The exe is a simple pyinstaller bundle that includes the necessary python scripts and dlls to run. If this still concerns you, you might wish to rebuild everything from source code using the makefile, and you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat`
diff --git a/clblast.dll b/clblast.dll
diff --git a/convert.py b/convert.py
@@ -735,7 +735,7 @@ def lazy_load_safetensors_file(fp: IO[bytes], path: Path) -> ModelPlus:
     header: Dict[str, Dict[str, Any]] = json.loads(fp.read(header_size))
     # Use mmap for the actual data to avoid race conditions with the file offset.
     mapped = memoryview(mmap.mmap(fp.fileno(), 0, access=mmap.ACCESS_READ))
-    byte_buf = mapped[fp.tell():]
+    byte_buf = mapped[8 + header_size:]
 
     def convert(info: Dict[str, Any]) -> LazyTensor:
         data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
@@ -761,7 +761,7 @@ def must_read(fp: IO[bytes], length: int) -> bytes:
     return ret
 
 
-def lazy_load_ggml_file(fp: IO[bytes], path: Path) -> ModelPlus:
+def lazy_load_ggml_file(fp: io.BufferedReader, path: Path) -> ModelPlus:
     magic = must_read(fp, 4)[::-1]
     if magic in (b'ggmf', b'ggjt'):
         version, = struct.unpack("i", must_read(fp, 4))
@@ -795,7 +795,9 @@ def lazy_load_ggml_file(fp: IO[bytes], path: Path) -> ModelPlus:
 
     model: LazyModel = {}
     # Use mmap for the actual data to avoid race conditions with the file offset.
+    off = fp.raw.tell()
     mapped = memoryview(mmap.mmap(fp.fileno(), 0, access=mmap.ACCESS_READ))
+    fp.raw.seek(off) # needed on Windows
 
     def read_tensor() -> None:  # this is a function so that variables captured in `load` don't change
         shape_len, name_len, ftype = struct.unpack("iii", must_read(fp, 12))

diff --git a/examples/benchmark/benchmark-q4_0-matmult.c b/examples/benchmark/benchmark-q4_0-matmult.c
@@ -24,7 +24,7 @@
 
 float tensor_sum_elements(struct ggml_tensor * tensor) {
     float sum = 0;
-    if (tensor->type==6) {
+    if (tensor->type==GGML_TYPE_F32) {
         for (int j = 0; j < tensor->ne[1]; j++) {
             for (int k = 0; k < tensor->ne[0]; k++) {
                 sum +=  ((float *) tensor->data)[j*tensor->ne[0]+k];