TinyTorch is a lightweight deep learning training framework implemented from scratch in C++.
For more details, please refer to my blog post: Write a nn training framework from scratch
- PyTorch-Style API: Similar naming conventions as PyTorch (
Tensor
,Functions
,nn.Module
,Optimizer
). - Pure C++ Implementation: No dependency on external deep learning libraries.
- CPU & CUDA Support: Runs on both CPU and CUDA-enabled GPUs.
- Mixed Precision: Supports FP16, FP32, BF16.
- Distributed: Multi-machine, multi-GPU training & inference.
- LLM Inference: Supports inference for llama/qwen/mistral models: https://github.com/keith2018/TinyGPT
relu
,gelu
,silu
softmax
,logSoftmax
add
,sub
,mul
,div
,matmul
sin
,cos
,sqrt
,pow
maximum
,minimum
lt
,le
,gt
,ge
,eq
,ne
logicNot
,logicAnd
,logicOr
min
,argmin
,max
,argmax
sum
,mean
,var
reshape
,view
,permute
,transpose
flatten
,unflatten
,squeeze
,unsqueeze
split
,concat
,stack
,hstack
,vstack
,narrow
topk
,sort
,cumsum
gather
,scatter
linear
dropout
maxPool2d
conv2d
embedding
layerNorm
rmsNorm
sdpAttention
mseLoss
nllLoss
SGD
,Adagrad
,RMSprop
,AdaDelta
,Adam
,AdamW
Dataset
,DataLoader
,data.Transform
TinyTorch's automatic differentiation (AD) is implemented by building a computation graph. Each operation on a Tensor
is represented by a Function
object, which is responsible for both the forward and backward passes. The Function
nodes are connected via a nextFunctions
field, creating the dependency graph. During the backward()
call, the framework traverses this graph in reverse order, computing and propagating gradients using the chain rule.
- CMake
- C++17 or a more recent compiler
- CUDA Toolkit 11.0+ (optional)
mkdir build
cmake -B ./build -DCMAKE_BUILD_TYPE=Release
cmake --build ./build --config Release
cd demo/bin
./TinyTorch_demo
cd build
ctest
This code is licensed under the MIT License (see LICENSE).