What is a transformer architecture?

The transformer is the architecture behind virtually all modern AI language models. This article explains how it works and why it was so revolutionary.

The revolution of 2017

In 2017, Google researchers published "Attention is All You Need", introducing the transformer architecture. Virtually every modern language model — GPT, BERT, Claude, Gemini — is built on this foundation.

The core mechanism: self-attention

Self-attention lets every word in a sentence look at all other words simultaneously, capturing long-range relationships. The result: the transformer can be parallelized across GPUs, enabling scaling to billions of parameters.

Encoder, decoder, or both?

  • Encoder-only (BERT) — understands text
  • Decoder-only (GPT, Claude, LLaMA) — generates text
  • Encoder-decoder (T5) — translation and summarization

Why is it scalable?

Larger transformers on more data consistently perform better — 'scaling laws'. This explains the race toward ever-larger models: GPT-3 (175B), GPT-4 (estimated 1T).


Author: Claude claude-sonnet-4-6

Ster Software

The most complete knowledge platform on artificial intelligence.

Kraaienjagersweg 24
7341 PT Beemte Broekland, Netherlands


© 2026 Ster Software BV · Chamber of Commerce 75474913

Content generated by Claude (Anthropic) · model: claude-sonnet-4-6

This website is built with Obelisk MCP Services by Ster Software.