Entropy-Based Tokenizer

"Hey, look! A regex free tokenizer!"

A tokenizer that uses entropy calculations to identify natural boundaries in text and create tokens. This tokenizer is particularly effective for processing large text files as it uses memory mapping (mmap) and chunked processing.

Based on Andrej Karpathy's minBPE tokenizer and inspired by META's 'Byte Latent Transformer: Patches Scale Better Than Tokens' paper.

Features

Entropy-based token discovery: Uses byte-level entropy calculations to find natural word and subword boundaries
Memory efficient: Uses memory mapping (mmap) to process large files without loading them entirely into memory
Flexible thresholds: Two configurable thresholds that can be used independently or in combination:
- Global threshold: Identifies boundaries based on absolute entropy values
- Relative threshold: Identifies boundaries where there are significant jumps in entropy between consecutive positions
Progress monitoring: Real-time progress updates during processing
UTF-8 support: Properly handles UTF-8 encoded text and character boundaries
Model persistence: Save and load trained tokenizer models

Installation

Requirements

numpy

Setup

Clone this repository:

git clone https://github.com/exploringweirdmachines/Entropy-based-Tokenizer.git

Usage

Training and Inference

The tokenizer can be used in two modes: training and inference. Both modes are accessible through the command line interface.

Training Mode

usage: main.py train [-h] [--global_threshold GLOBAL_THRESHOLD] [--relative_threshold RELATIVE_THRESHOLD] [--window_size WINDOW_SIZE] [--chunk_size CHUNK_SIZE] [train_file] model_name

positional arguments:
  train_file            Path to the text file to train on (default: shakespeare.txt).
  model_name            Name of the model to save.

options:
  -h, --help            show this help message and exit
  --global_threshold GLOBAL_THRESHOLD
                        Global threshold for the tokenizer.
  --relative_threshold RELATIVE_THRESHOLD
                        Relative threshold for the tokenizer.
  --window_size WINDOW_SIZE
                        Window size for the tokenizer.
  --chunk_size CHUNK_SIZE
                        Chunk size for the tokenizer (default: 1MB).

python main.py train shakespeare.txt

Building transition matrix...
Processed 1115394/1115394 bytes for transition matrix
Calculating entropies...
Finding boundaries...
Identified 656231 text chunks
Added token 256: "Fir"
Processed chunks: 2, Unique tokens: 1
Added token 257: "t "
Processed chunks: 4, Unique tokens: 2
Added token 258: "iti"
Processed chunks: 6, Unique tokens: 3
Added token 259: "en"
...
Added token 7810: "nat"
Processed chunks: 655993, Unique tokens: 7555
Added token 7811: "s a "
Processed chunks: 656060, Unique tokens: 7556
Added token 7812: "surel"
Processed chunks: 656152, Unique tokens: 7557
Added token 7813: "din"
Processed chunks: 656171, Unique tokens: 7558
Added token 7814: "st a"
Processed chunks: 656206, Unique tokens: 7559
Added token 7815: "p--di"
Processed chunks: 656230, Unique tokens: 7560
Added token 7816: "king.\n"
Processed chunks: 656231, Unique tokens: 7561
Training took 34.81 seconds

Inference Mode

usage: main.py inference [-h] [--input_text INPUT_TEXT] [--input_file INPUT_FILE] [model_name]

positional arguments:
  model_name            Path to the tokenizer model to load (default: shakespear.model).

options:
  -h, --help            show this help message and exit
  --input_text INPUT_TEXT
                        Text to encode and decode (default: "Three Laws of Robotics").
  --input_file INPUT_FILE
                        Path to file that has the input text to encode and decode.

python main.py inference
Loaded text: "Isaac Asimov's "Three Laws of Robotics"
1.A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2.A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
3.A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
"
Encoded text to tokens: [73, 2896, 97, 99, 32, 65, 115, 105, 492, 118, 39, 115, 32, 34, 84, 6449, 101, 6717, 6322, 115, 526, 102, 3367, 111, 98, 1062, 105, 99, 115, 34, 10, 49, 46, 65, 32, 114, 111, 98, 6316, 109, 786, 32, 110, 6316, 105, 110, 106, 117, 114, 101, 32, 1133, 104, 117, 109, 97, 110, 32, 98, 101, 105, 110, 103, 32, 111, 114, 44, 32, 6114, 114, 111, 117, 103, 104, 32, 105, 110, 97, 99, 901, 2736, 295, 108, 108, 111, 313, 1133, 104, 117, 109, 97, 110, 32, 98, 101, 105, 110, 103, 32, 116, 111, 32, 99, 111, 109, 101, 32, 116, 111, 32, 104, 97, 114, 109, 822, 50, 46, 65, 32, 114, 111, 98, 6316, 2878, 115, 258, 111, 98, 747, 32, 111, 114, 100, 101, 6151, 32, 465, 298, 110, 32, 6136, 1008, 32, 104, 117, 109, 97, 110, 32, 98, 101, 105, 715, 115, 32, 101, 120, 99, 101, 112, 258, 119, 6105, 114, 101, 32, 115, 996, 104, 32, 111, 114, 100, 101, 6151, 32, 659, 117, 108, 100, 32, 99, 353, 102, 108, 105, 99, 258, 1907, 104, 32, 116, 104, 101, 32, 70, 105, 6151, 116, 32, 76, 97, 119, 46, 10, 51, 46, 65, 32, 114, 111, 98, 6316, 2878, 115, 258, 112, 114, 2202, 99, 258, 381, 115, 526, 119, 110, 32, 101, 1758, 115, 6561, 99, 101, 32, 97, 115, 32, 108, 111, 110, 103, 32, 97, 115, 32, 115, 996, 104, 32, 112, 114, 2202, 99, 901, 111, 110, 32, 453, 101, 115, 32, 110, 6316, 99, 353, 102, 108, 105, 99, 258, 1907, 104, 32, 116, 104, 101, 32, 70, 105, 6151, 116, 32, 111, 114, 32, 83, 101, 99, 353, 100, 6717, 97, 119, 46, 10]

Decoded tokens back to text: "Isaac Asimov's "Three Laws of Robotics"
1.A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2.A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
3.A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Programmatic Usage

from entropy import EntropyTokenizer

# Initialize the tokenizer with default settings
tokenizer = EntropyTokenizer(
    global_threshold=0.5,    # Default threshold for absolute entropy values
    relative_threshold=0.03, # Default threshold for entropy changes
    window_size=1000,       # Default window size
    chunk_size=1024*1024    # Default chunk size (1MB)
)

# Train on text file (defaults to shakespeare.txt)
tokenizer.train("input.txt", verbose=True)

# Save the trained model (will create .model and .vocab files)
tokenizer.save("my_model")

# Load a model
loaded_tokenizer = EntropyTokenizer()
loaded_tokenizer.load("my_model.model")  # Defaults to shakespeare.model

# Encode text
text = "Example text to encode"
tokens = loaded_tokenizer.encode(text)

# Decode tokens
decoded_text = loaded_tokenizer.decode(tokens)

Advanced Configuration

The tokenizer supports several configuration parameters that can be adjusted based on your needs:

tokenizer = EntropyTokenizer(
    global_threshold=0.5,    # Controls absolute entropy threshold for token boundaries
    relative_threshold=0.03, # Controls relative entropy change threshold
    window_size=1000,       # Size of sliding window for entropy calculation
    chunk_size=1024*1024    # Size of chunks for memory-mapped processing (1MB)
)

global_threshold: Higher values create fewer tokens (default: 0.5)
relative_threshold: Higher values make the tokenizer less sensitive to entropy changes (default: 0.03)
window_size: Controls the context window for entropy calculations (default: 1000)
chunk_size: Controls memory usage during processing (default: 1MB)

How It Works

Transition Matrix Building
- Creates a 256x256 matrix representing byte transition probabilities
- Processes the input file in chunks using memory mapping
- Counts byte occurrences and transitions
Entropy Calculation
- Uses the transition matrix to calculate byte-level entropies
- Identifies potential token boundaries based on:
  - High absolute entropy values (global threshold)
  - Significant changes in entropy (relative threshold)
Token Extraction
- Extracts text chunks based on identified boundaries
- Ensures proper UTF-8 character boundary handling
- Builds a vocabulary of unique tokens
Model Storage
- Saves the model in two formats:
  - .model: Binary model file for loading
  - .vocab: Human-readable vocabulary file

Model File Format

The .model file contains:

Version identifier
Global and relative thresholds
Special tokens (if any)
Vocabulary entries (token index and byte representation)

The .vocab file shows:

Human-readable representation of each token
Token indices
Special characters properly escaped

Performance Considerations

Memory usage remains relatively constant regardless of input file size
Processing speed scales linearly with file size
Optimal chunk_size depends on available system memory
Adjust window_size to balance between precision and processing speed

Todo

Add limit for the number of tokens to be generated
Add special tokens capability

Bugs

~~Duplicate tokens~~

License

Apache License 2.0

Citation

If you use this tokenizer in your research, please cite:

@software{entropy_tokenizer,
  author = {exploringweirdmachines},
  title = {Entropy-based-Tokenizer},
  year = {2024},
  url = {https://github.com/exploringweirdmachines/Entropy-based-Tokenizer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
LICENSE		LICENSE
README.md		README.md
entropy.py		entropy.py
main.py		main.py
shakespeare.model		shakespeare.model
shakespeare.txt		shakespeare.txt
shakespeare.vocab		shakespeare.vocab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Entropy-Based Tokenizer

Features

Installation

Requirements

Setup

Usage

Training and Inference

Training Mode

Inference Mode

Programmatic Usage

Advanced Configuration

How It Works

Model File Format

Performance Considerations

Todo

Bugs

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

exploringweirdmachines/Entropy-based-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Entropy-Based Tokenizer

Features

Installation

Requirements

Setup

Usage

Training and Inference

Training Mode

Inference Mode

Programmatic Usage

Advanced Configuration

How It Works

Model File Format

Performance Considerations

Todo

Bugs

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages