Add gpt2 implementations in python and c++ #1

ChinmayK0607 · 2025-07-08T14:44:22Z

Adds two files:

train.cpp -> gpt2 implementation in c++
train.py -> gpt2 implementation in python

Adds dataloaders as well as requirements file as well.

kamatajinkya2 · 2025-07-08T14:58:24Z

General

Prefer pyproject.toml over requirements.txt. Refer to Why Should I Choose pyproject.toml over requirements.txt for managing dependencies?
Use a nested folder structure. Aka train.py, dataloader.py go inside src or llm101 or scripts folder. This will help in adding a test folder
Implement unit tests (Not applicable in this instance, but just typing out)
Have appropriate white spaces after a class of a function ends. You can use autoformatters like Black to achieve this.

To be continued...

kamatajinkya2 · 2025-07-08T15:00:49Z

GPT2/dataloader.py

+import numpy as np
+
+# download the tiny shakespeare dataset
+input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')


Use if name main pattern to make this module reusable also to prevent wonky variable scoping

kamatajinkya2 · 2025-07-08T15:02:14Z

GPT2/dataloader.py

+input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
+if not os.path.exists(input_file_path):
+    data_url = 'https://github.com/raw/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
+    with open(input_file_path, 'w', encoding='utf-8') as f:
+        f.write(requests.get(data_url).text)
+
+with open(input_file_path, 'r', encoding='utf-8') as f:
+    data = f.read()
+n = len(data)
+train_data = data[:int(n*0.9)]
+val_data = data[int(n*0.9):]
+
+# encode with tiktoken gpt2 bpe
+enc = tiktoken.get_encoding("gpt2")
+train_ids = enc.encode_ordinary(train_data)
+val_ids = enc.encode_ordinary(val_data)
+print(f"train has {len(train_ids):,} tokens")
+print(f"val has {len(val_ids):,} tokens")
+
+# export to bin files
+train_ids = np.array(train_ids, dtype=np.uint16)
+val_ids = np.array(val_ids, dtype=np.uint16)
+train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
+val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))


Suggested change

input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')

if not os.path.exists(input_file_path):

data_url = 'https://github.com/raw/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

with open(input_file_path, 'w', encoding='utf-8') as f:

f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:

data = f.read()

n = len(data)

train_data = data[:int(n*0.9)]

val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe

enc = tiktoken.get_encoding("gpt2")

train_ids = enc.encode_ordinary(train_data)

val_ids = enc.encode_ordinary(val_data)

print(f"train has {len(train_ids):,} tokens")

print(f"val has {len(val_ids):,} tokens")

# export to bin files

train_ids = np.array(train_ids, dtype=np.uint16)

val_ids = np.array(val_ids, dtype=np.uint16)

train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))

val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

def main():

input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')

if not os.path.exists(input_file_path):

data_url = 'https://github.com/raw/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

with open(input_file_path, 'w', encoding='utf-8') as f:

f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:

data = f.read()

n = len(data)

train_data = data[:int(n*0.9)]

val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe

enc = tiktoken.get_encoding("gpt2")

train_ids = enc.encode_ordinary(train_data)

val_ids = enc.encode_ordinary(val_data)

print(f"train has {len(train_ids):,} tokens")

print(f"val has {len(val_ids):,} tokens")

# export to bin files

train_ids = np.array(train_ids, dtype=np.uint16)

val_ids = np.array(val_ids, dtype=np.uint16)

train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))

val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

if __name__ == '__main__':

main()

kamatajinkya2 · 2025-07-09T11:14:49Z

GPT2/train.py

+from torch.nn import functional as F 
+import tiktoken
+
+batch_size = 64 


Dont use global variables

kamatajinkya2 · 2025-07-09T11:21:26Z

GPT2/train.py

+
+        return logits,loss
+
+


use if __name__ == "__main__":

kamatajinkya2 · 2025-07-09T11:40:36Z

GPT2/train.py

+
+    def __init__weights(self, module):
+        if isinstance(module, nn.Linear):
+            std = 0.02


Move this to else block. This is causing confusion.

kamatajinkya2 · 2025-07-09T12:42:33Z

GPT2/train.cpp

+};
+
+struct CausalSelfAttentionImpl : torch::nn::Module {
+    CausalSelfAttentionImpl(const Config& cfg) {


Prefer initizliser list. This way your compiler can warn you if there are uninitialized variables

kamatajinkya2 · 2025-07-09T12:43:59Z

GPT2/train.cpp

+        mask = m;
+        register_buffer("mask", mask);
+    }
+    torch::Tensor forward(const torch::Tensor& x) {


I like east const. Checkout https://hackingcpp.com/cpp/design/east_vs_west_const.html

kamatajinkya2 · 2025-07-09T12:47:24Z

GPT2/train.cpp

+        auto out = y.permute({0, 2, 1, 3}).contiguous().view({B, T, n_embed});
+        return proj->forward(out);
+    }
+    int64_t n_embed, n_head, head_dim;


Typically in C++ member variables have a m_ prefix. This prevents variable shadowing

kamatajinkya2 · 2025-07-09T12:48:13Z

GPT2/train.cpp

+};
+TORCH_MODULE(CausalSelfAttention);
+
+struct MLPImpl : torch::nn::Module {


Mark this as final so no one accidently inherets.

kamatajinkya2 · 2025-07-09T12:48:38Z

GPT2/train.cpp

+TORCH_MODULE(GPT);
+
+int main() {
+    Config cfg;


Prefer auto initialization. This prevents uninitialized garbage values

ChinmayK0607 added 5 commits July 8, 2025 20:07

Create train.cpp

b046526

Create train.py

ccef61f

Create dataloader.py

d6973ae

Update README.md

33954a9

Create requirements.txt

ecc0284

kamatajinkya2 reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add gpt2 implementations in python and c++ #1

Add gpt2 implementations in python and c++ #1

Uh oh!

ChinmayK0607 commented Jul 8, 2025

Uh oh!

kamatajinkya2 commented Jul 8, 2025 •

edited

Loading

Uh oh!

kamatajinkya2 Jul 8, 2025

Uh oh!

kamatajinkya2 Jul 8, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

kamatajinkya2 Jul 9, 2025

Uh oh!

Uh oh!

Add gpt2 implementations in python and c++ #1

Are you sure you want to change the base?

Add gpt2 implementations in python and c++ #1

Uh oh!

Conversation

ChinmayK0607 commented Jul 8, 2025

Uh oh!

kamatajinkya2 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kamatajinkya2 commented Jul 8, 2025 •

edited

Loading