Manual Walkthrough of Attention Mechanism in Sequence-to-Sequence Models

The attention mechanism revolutionized sequence-to-sequence (seq2seq) models by allowing the decoder to dynamically focus on relevant parts of the input sequence. This blog post manually walks through a small example using dot-product attention, illustrating every computation step with numbers.

Setup

Let’s define the hidden states from the encoder and the current hidden state from the decoder:

Encoder hidden states (\( h_1, h_2, h_3 \)) are vectors in \( \mathbb{R}^2 \)
Decoder hidden state at time \( t \), denoted \( s_t \in \mathbb{R}^2 \)

Given:

\[ h_1 = \begin{bmatrix}1 \\ 0\end{bmatrix}, \quad h_2 = \begin{bmatrix}0 \\ 1\end{bmatrix}, \quad h_3 = \begin{bmatrix}1 \\ 1\end{bmatrix}, \quad s_t = \begin{bmatrix}1 \\ 2\end{bmatrix} \]

Step 1: Compute Attention Scores \( e^t \)

Each attention score is the dot product between the decoder state \( s_t \) and encoder hidden states:

\[ e_1 = s_t^T h_1 = [1\ 2] \cdot [1\ 0] = 1 \] \[ e_2 = s_t^T h_2 = [1\ 2] \cdot [0\ 1] = 2 \] \[ e_3 = s_t^T h_3 = [1\ 2] \cdot [1\ 1] = 3 \]

Thus, the score vector is:

\[ e^t = \begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix} \]

Step 2: Compute Attention Weights via Softmax

We convert scores to a probability distribution:

\[ \alpha^t_i = \frac{\exp(e_i)}{\sum_{j=1}^3 \exp(e_j)} \]

Computing exponentials:

\( \exp(1) \approx 2.718 \)
\( \exp(2) \approx 7.389 \)
\( \exp(3) \approx 20.085 \)

Partition function:

\[ Z = 2.718 + 7.389 + 20.085 = 30.192 \]

Now compute each attention weight:

\[ \alpha_1^t = \frac{2.718}{30.192} \approx 0.09,\quad \alpha_2^t = \frac{7.389}{30.192} \approx 0.24,\quad \alpha_3^t = \frac{20.085}{30.192} \approx 0.66 \]

Step 3: Compute Attention Output \( a_t \)

We take the weighted sum of the encoder hidden states using the attention weights:

\[ a_t = \sum_{i=1}^3 \alpha_i^t h_i = 0.09 \cdot h_1 + 0.24 \cdot h_2 + 0.66 \cdot h_3 \]

Computing each term:

\[ 0.09 \cdot \begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}0.09 \\ 0\end{bmatrix},\quad 0.24 \cdot \begin{bmatrix}0 \\ 1\end{bmatrix} = \begin{bmatrix}0 \\ 0.24\end{bmatrix},\quad 0.66 \cdot \begin{bmatrix}1 \\ 1\end{bmatrix} = \begin{bmatrix}0.66 \\ 0.66\end{bmatrix} \]

Summing them up:

\[ a_t = \begin{bmatrix}0.09 + 0 + 0.66 \\ 0 + 0.24 + 0.66\end{bmatrix} = \begin{bmatrix}0.75 \\ 0.90\end{bmatrix} \]

Step 4: Concatenate with Decoder State

We concatenate the context vector \( a_t \) with the decoder hidden state \( s_t \):

\[ [s_t; a_t] = \begin{bmatrix}1 \\ 2 \\ 0.75 \\ 0.90\end{bmatrix} \in \mathbb{R}^4 \]

Summary Table

Component	Value
Encoder Hidden States	\( h_1 = [1, 0],\ h_2 = [0, 1],\ h_3 = [1, 1] \)
Decoder State	\( s_t = [1, 2] \)
Attention Scores	\( e^t = [1, 2, 3] \)
Attention Weights	\( \alpha^t = [0.09, 0.24, 0.66] \)
Context Vector	\( a_t = [0.75, 0.90] \)
Final Vector	\( [s_t; a_t] = [1, 2, 0.75, 0.90] \)

Conclusion

This example demonstrates the intuition and calculation behind the attention mechanism. By aligning the decoder state with encoder states via softmax-weighted dot products, attention helps models focus on the most relevant inputs dynamically. This mechanism has become a cornerstone of modern NLP architectures like Transformers.

My Research Notes

Sunday, 8 June 2025