Manual Walkthrough of Attention Mechanism in Sequence-to-Sequence Models
The attention mechanism revolutionized sequence-to-sequence (seq2seq) models by allowing the decoder to dynamically focus on relevant parts of the input sequence. This blog post manually walks through a small example using dot-product attention, illustrating every computation step with numbers.
Setup
Let’s define the hidden states from the encoder and the current hidden state from the decoder:
- Encoder hidden states (\( h_1, h_2, h_3 \)) are vectors in \( \mathbb{R}^2 \)
- Decoder hidden state at time \( t \), denoted \( s_t \in \mathbb{R}^2 \)
Given:
\[ h_1 = \begin{bmatrix}1 \\ 0\end{bmatrix}, \quad h_2 = \begin{bmatrix}0 \\ 1\end{bmatrix}, \quad h_3 = \begin{bmatrix}1 \\ 1\end{bmatrix}, \quad s_t = \begin{bmatrix}1 \\ 2\end{bmatrix} \]Step 1: Compute Attention Scores \( e^t \)
Each attention score is the dot product between the decoder state \( s_t \) and encoder hidden states:
\[ e_1 = s_t^T h_1 = [1\ 2] \cdot [1\ 0] = 1 \] \[ e_2 = s_t^T h_2 = [1\ 2] \cdot [0\ 1] = 2 \] \[ e_3 = s_t^T h_3 = [1\ 2] \cdot [1\ 1] = 3 \]Thus, the score vector is:
\[ e^t = \begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix} \]Step 2: Compute Attention Weights via Softmax
We convert scores to a probability distribution:
\[ \alpha^t_i = \frac{\exp(e_i)}{\sum_{j=1}^3 \exp(e_j)} \]Computing exponentials:
- \( \exp(1) \approx 2.718 \)
- \( \exp(2) \approx 7.389 \)
- \( \exp(3) \approx 20.085 \)
Partition function:
\[ Z = 2.718 + 7.389 + 20.085 = 30.192 \]Now compute each attention weight:
\[ \alpha_1^t = \frac{2.718}{30.192} \approx 0.09,\quad \alpha_2^t = \frac{7.389}{30.192} \approx 0.24,\quad \alpha_3^t = \frac{20.085}{30.192} \approx 0.66 \]Step 3: Compute Attention Output \( a_t \)
We take the weighted sum of the encoder hidden states using the attention weights:
\[ a_t = \sum_{i=1}^3 \alpha_i^t h_i = 0.09 \cdot h_1 + 0.24 \cdot h_2 + 0.66 \cdot h_3 \]Computing each term:
\[ 0.09 \cdot \begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}0.09 \\ 0\end{bmatrix},\quad 0.24 \cdot \begin{bmatrix}0 \\ 1\end{bmatrix} = \begin{bmatrix}0 \\ 0.24\end{bmatrix},\quad 0.66 \cdot \begin{bmatrix}1 \\ 1\end{bmatrix} = \begin{bmatrix}0.66 \\ 0.66\end{bmatrix} \]Summing them up:
\[ a_t = \begin{bmatrix}0.09 + 0 + 0.66 \\ 0 + 0.24 + 0.66\end{bmatrix} = \begin{bmatrix}0.75 \\ 0.90\end{bmatrix} \]Step 4: Concatenate with Decoder State
We concatenate the context vector \( a_t \) with the decoder hidden state \( s_t \):
\[ [s_t; a_t] = \begin{bmatrix}1 \\ 2 \\ 0.75 \\ 0.90\end{bmatrix} \in \mathbb{R}^4 \]Summary Table
| Component | Value |
|---|---|
| Encoder Hidden States | \( h_1 = [1, 0],\ h_2 = [0, 1],\ h_3 = [1, 1] \) |
| Decoder State | \( s_t = [1, 2] \) |
| Attention Scores | \( e^t = [1, 2, 3] \) |
| Attention Weights | \( \alpha^t = [0.09, 0.24, 0.66] \) |
| Context Vector | \( a_t = [0.75, 0.90] \) |
| Final Vector | \( [s_t; a_t] = [1, 2, 0.75, 0.90] \) |
Conclusion
This example demonstrates the intuition and calculation behind the attention mechanism. By aligning the decoder state with encoder states via softmax-weighted dot products, attention helps models focus on the most relevant inputs dynamically. This mechanism has become a cornerstone of modern NLP architectures like Transformers.
No comments:
Post a Comment