My Research Notes: The Transformer Paper: “Attention is All You Need” (Vaswani et al., NIPS 2017)

Friday, 9 May 2025

The Transformer Paper: “Attention is All You Need” (Vaswani et al., NIPS 2017)

Here's a storytelling version of the “Attention is All You Need” paper that weaves the technical details into a narrative:

🧙‍♂️ The Tale of the Transformer: A New Spell in the Kingdom of Language

Once upon a time, in the vast and complex Kingdom of Language, wizards known as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) reigned supreme. These magical beings could read and translate texts from one language to another, but they had a flaw—they were slow, tired easily, and struggled to remember distant words in long sentences.

Despite enhancements like Long Short-Term Memory (LSTM) spells and attention charms, they still trudged word by word, step by step. The kingdom yearned for a new spell—one that could see the whole sentence at once, with no memory lapses, and perform faster than ever before.

🌟 And then, from the research halls of Google Brain, a band of mages—Ashish, Noam, Niki, Jakob, and others—gathered to forge a new artifact. They called it the Transformer.

🔮 The Birth of the Transformer

Unlike its predecessors, the Transformer had no moving parts like loops or filters. It was crafted entirely from self-attention magic—a spell that allowed every word to look at every other word, no matter how far apart they were.

The Transformer had two wings:

An Encoder to read the input sentence.
A Decoder to write the translated sentence.

Each wing was built with 6 towers, and inside each tower were powerful components:

Multi-Head Attention: Like a council of wise seers, each head focused on different aspects of the sentence.
Feedforward Networks: Silent scribes who transformed thoughts at each position.
Positional Encodings: Since the Transformer had no memory of order, these encodings whispered the location of each word using sine and cosine waves.

⚔️ How the Magic Worked

When a sentence was spoken to the Transformer, it didn’t read left to right. Instead, all words spoke at once, and the attention mechanism let them listen to each other, weighing which words mattered most.

In the Decoder, a veil was drawn—the Transformer could not peek ahead into the future, preserving the element of surprise and ensuring proper sequence generation.

The Multi-Head Attention allowed the Transformer to focus on grammar, meaning, position, and syntax—all at once, through different heads. It was like having 8 minds thinking in harmony.

🏆 The Great Benchmark Battle

To prove its power, the Transformer was sent into the WMT 2014 Machine Translation Tournament, facing off against powerful beasts like GNMT and ConvS2S.

And lo! It conquered the English-to-German challenge with a BLEU score of 28.4, surpassing even the mighty ensembles. In the English-to-French task, it scored 41.0, defeating all single-model rivals—and it did this with far less training cost and time.

📜 Lessons from the Journey

The scholars learned much:

More heads aren’t always better, but 8 was just right.
Bigger models learned better, if trained carefully.
Label smoothing and dropout kept the Transformer from overfitting and growing arrogant.
Sinusoidal positions were just as good as learned ones—and more elegant.

🌌 Legacy and Beyond

Thus, the Transformer was crowned a legend. From its lineage came great descendants: BERT, GPT, T5, ViT—each carrying the legacy of attention-forward thinking.

The kingdom of language was forever transformed, and a new age of parallel, fast, and powerful understanding dawned.

And the mages, satisfied with their creation, released it to the world:
👉 https://github.com/tensorflow/tensor2tensor

My Research Notes