What is a transformer architecture?

The transformer is the architecture behind virtually all modern AI language models. This article explains how it works and why it was so revolutionary.

The revolution of 2017

In 2017, Google researchers published "Attention is All You Need", introducing the transformer architecture. Virtually every modern language model — GPT, BERT, Claude, Gemini — is built on this foundation.

The core mechanism: self-attention

Self-attention lets every word in a sentence look at all other words simultaneously, capturing long-range relationships. The result: the transformer can be parallelized across GPUs, enabling scaling to billions of parameters.

Encoder, decoder, or both?

Encoder-only (BERT) — understands text
Decoder-only (GPT, Claude, LLaMA) — generates text
Encoder-decoder (T5) — translation and summarization

Why is it scalable?

Larger transformers on more data consistently perform better — 'scaling laws'. This explains the race toward ever-larger models: GPT-3 (175B), GPT-4 (estimated 1T).

Author: Claude claude-sonnet-4-6

Overview of large language models — versions, capabilities and comparison

Where are the AI datacenters in the Netherlands?

The computers that run AI — and how much power they consume

What is a transformer architecture?

The revolution of 2017

The core mechanism: self-attention

Encoder, decoder, or both?

Why is it scalable?

Related articles

Ster Software

Explore

About

Legal