Understanding BLEU and Perplexity for Comprehensive Text Generation Evaluation
Evaluating machine-generated text requires more than one metric to capture both fluency and correctness. A common combination is using Perplexity with BLEU or ROUGE. While perplexity measures how fluent or likely a sentence is under a language model, BLEU measures how closely the output resembles a human-written reference. In this post, we deep-dive into BLEU: its meaning, formula, a worked-out example, and implementation in Python.
🔍 What is BLEU?
BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-generated text, especially in tasks like machine translation, text summarization, or language generation.
- Precision-based: BLEU measures how many n-grams in the candidate also appear in the reference. It checks n-gram overlap, not semantic meaning.
- N-gram matching: Typically calculated for 1-gram to 4-gram (BLEU-1 to BLEU-4).
- Brevity Penalty: Penalizes overly short outputs to avoid gaming the score.
- Score Range: 0 to 1 (or 0% to 100%). Higher is better.
📘 Example
Reference: "The cat is on the mat."
Candidate: "The cat sat on the mat."
BLEU-1 (unigram): 5 of 6 words match → high score
BLEU-2 (bigram): "the cat", "on the mat" match → lower score
🔗 Why Combine with Perplexity?
- Perplexity measures fluency: How likely is the sentence under the language model?
- BLEU measures fidelity: How closely does it match a reference?
Together, they offer a comprehensive evaluation:
- BLEU/ROUGE → correctness
- Perplexity → naturalness
🔢 BLEU Score Formula
The BLEU score is calculated using a modified n-gram precision with a brevity penalty:
\[ \text{BLEU} = \text{BP} \cdot \exp \left( \sum_{n=1}^N w_n \cdot \log p_n \right) \]Where:
- \( N \): Maximum n-gram order (e.g., 4 for BLEU-4)
- \( w_n \): Weight for n-gram precision (usually \( w_n = \frac{1}{N} \))
- \( p_n \): Modified precision for n-grams of size \( n \)
- \( \text{BP} \): Brevity Penalty
✅ Modified n-gram precision
\[ \text{p_n} = \frac{\sum_{\text{n-gram} \in \text{candidate}} \min(\text{count}_\text{candidate}, \text{max\_ref\_count})}{\sum_{\text{n-gram} \in \text{candidate}} \text{count}_\text{candidate}} \]✂️ Brevity Penalty (BP)
\[ \text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{(1 - \frac{r}{c})} & \text{if } c \leq r \end{cases} \]\( c \): candidate length, \( r \): reference length
🧮 Worked-out BLEU Example (BLEU-2)
📝 Setup
Candidate: "the cat is on the mat"
Reference: "there is a cat on the mat"
🔹 Step 1: Extract n-grams
Unigrams
Candidate: ["the", "cat", "is", "on", "the", "mat"]
Reference: ["there", "is", "a", "cat", "on", "the", "mat"]
Bigrams
Candidate: ["the cat", "cat is", "is on", "on the", "the mat"]
Reference: ["there is", "is a", "a cat", "cat on", "on the", "the mat"]
🔹 Step 2: Modified Precision
Unigram: Matches = 5, Total = 6 → \( p_1 = \frac{5}{6} \)
Bigram: Matches = 2, Total = 5 → \( p_2 = \frac{2}{5} \)
🔹 Step 3: Brevity Penalty
\[ c = 6, \quad r = 7 \text{BP} = e^{1 - 7/6} = e^{-1/6} \approx 0.8465 \]🔹 Step 4: Compute BLEU
\[ \text{BLEU} = 0.8465 \cdot \exp \left( 0.5 \cdot \log\left(\frac{5}{6}\right) + 0.5 \cdot \log\left(\frac{2}{5}\right) \right) = 0.8465 \cdot \exp(-0.549) \approx 0.8465 \cdot 0.577 \approx 0.4885 \]Final BLEU-2 Score ≈ 0.4885 (or 48.85%)
💻 How to Implement BLEU in Python
✅ 1. Using NLTK
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
reference = [['there', 'is', 'a', 'cat', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'is', 'on', 'the', 'mat']
smoothie = SmoothingFunction().method1
score1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoothie)
score2 = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0), smoothing_function=smoothie)
score4 = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)
print(f"BLEU-1 Score: {score1:.4f}")
print(f"BLEU-2 Score: {score2:.4f}")
print(f"BLEU-4 Score: {score4:.4f}")
📦 To install NLTK
pip install nltk
import nltk
nltk.download('punkt')
⚙️ 2. Manual Implementation
If you want to build BLEU from scratch (e.g., BLEU-2), you can do so for educational purposes. Reach out for a step-by-step implementation if you're interested.
🔚 Conclusion
BLEU offers a powerful, quantitative way to assess the quality of generated text in tasks like translation and summarization. When used together with perplexity, you can evaluate both fidelity and fluency of your model outputs — a combination that’s essential for any serious NLP evaluation pipeline.
No comments:
Post a Comment