GPT-2 From Scratch: A First-Principles Transformer

Tags: LLM · Transformers · PyTorch

A from-scratch implementation of a GPT-2 style transformer in PyTorch, built up one component at a time to understand the architecture from first principles rather than calling into a library. Each block is implemented by hand and tested against a reference GPT-2 (via TransformerLens) to confirm the activations match:

  • LayerNorm — normalization with learned scale and bias
  • Token & positional embeddings — lookup tables mapping tokens and positions into the residual stream
  • Multi-head self-attention — Q/K/V projections, masked attention scores, and per-head mixing
  • MLP / feed-forward block with GELU
  • Transformer block — residual connections wiring attention and MLP together
  • Unembedding to logits, assembled into the full transformer

A small training loop then trains the model end-to-end on text, demonstrating next-token prediction on the architecture built above.

Built while following Neel Nanda’s transformer implementation tutorial.

Colab notebook · Reference video