GPT-2 From Scratch: A First-Principles Transformer

Tags: LLM · Transformers · PyTorch

A from-scratch implementation of a GPT-2 style transformer in PyTorch, built up one component at a time to understand the architecture from first principles rather than calling into a library. Each block is implemented by hand and tested against a reference GPT-2 (via TransformerLens) to confirm the activations match:

LayerNorm — normalization with learned scale and bias
Token & positional embeddings — lookup tables mapping tokens and positions into the residual stream
Multi-head self-attention — Q/K/V projections, masked attention scores, and per-head mixing
MLP / feed-forward block with GELU
Transformer block — residual connections wiring attention and MLP together
Unembedding to logits, assembled into the full transformer

A small training loop then trains the model end-to-end on text, demonstrating next-token prediction on the architecture built above.

Built while following Neel Nanda’s transformer implementation tutorial.

Colab notebook · Reference video

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Shariar Kabir

Share on