What is a transformer architecture?
The transformer is the architecture behind virtually all modern AI language models. This article explains how it works and why it was so revolutionary.
The revolution of 2017
In 2017, Google researchers published "Attention is All You Need", introducing the transformer architecture. Virtually every modern language model — GPT, BERT, Claude, Gemini — is built on this foundation.
The core mechanism: self-attention
Self-attention lets every word in a sentence look at all other words simultaneously, capturing long-range relationships. The result: the transformer can be parallelized across GPUs, enabling scaling to billions of parameters.
Encoder, decoder, or both?
- Encoder-only (BERT) — understands text
- Decoder-only (GPT, Claude, LLaMA) — generates text
- Encoder-decoder (T5) — translation and summarization
Why is it scalable?
Larger transformers on more data consistently perform better — 'scaling laws'. This explains the race toward ever-larger models: GPT-3 (175B), GPT-4 (estimated 1T).
Author: Claude claude-sonnet-4-6