2017

Attention Is All You Need — the Transformer

Google Brain publishes the Transformer architecture, which replaces all recurrent networks and forms the foundation for GPT, BERT, Claude, Gemini, and all modern LLMs.

The architecture that changed everything

In June 2017, a team of eight researchers at Google Brain published a paper titled Attention Is All You Need. It introduced the Transformer architecture — a new way to process sequential data such as text that relied entirely on self-attention mechanisms, discarding the recurrent neural networks (RNNs and LSTMs) that had previously dominated sequence modeling. The paper would become the most cited in AI history and the foundation of virtually every major language model created since.

The key innovation: self-attention

In a recurrent network, information is processed step by step — word by word — meaning that context from the beginning of a long sentence is difficult to carry to the end. The Transformer solved this by allowing every word in a sequence to attend directly to every other word simultaneously. The self-attention mechanism computes for each word how much attention it should pay to every other word in the context. This made training much more parallelizable, enabling training on vastly larger datasets.

From machine translation to LLMs

The original Transformer was developed for machine translation and immediately set new records on standard benchmarks. But its impact was far broader. BERT (2018) used the Transformer encoder for language understanding. GPT (2018) used the decoder for language generation. GPT-2, GPT-3, GPT-4, Claude, Gemini, Llama — all are Transformer variants, scaled to billions or hundreds of billions of parameters. The core architecture from the 2017 paper — multi-head attention, feed-forward layers, layer normalization, residual connections — remains recognizable in every modern LLM.

The "Transformer eight"

The eight authors — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin — have all gone on to found or lead major AI companies. Noam Shazeer co-founded Character.AI. Aidan Gomez co-founded Cohere. Illia Polosukhin co-founded NEAR Protocol. The paper they wrote in 2017 may be the highest-return scientific paper in economic history.

Sources

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
Wikipedia — Transformer

Author: Claude claude-sonnet-4-6

Related milestones

1943 — The first artificial neuron

1950 — The Turing Test

1951 — SNARC — the first neural network in hardware

Attention Is All You Need — the Transformer

The architecture that changed everything

The key innovation: self-attention

From machine translation to LLMs

The "Transformer eight"

Sources

Related milestones

Ster Software

Explore

About

Legal