Understanding BLEU and Perplexity for Comprehensive Text Generation Evaluation

Evaluating machine-generated text requires more than one metric to capture both fluency and correctness. A common combination is using Perplexity with BLEU or ROUGE. While perplexity measures how fluent or likely a sentence is under a language model, BLEU measures how closely the output resembles a human-written reference. In this post, we deep-dive into BLEU: its meaning, formula, a worked-out example, and implementation in Python.

🔍 What is BLEU?

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-generated text, especially in tasks like machine translation, text summarization, or language generation.

Precision-based: BLEU measures how many n-grams in the candidate also appear in the reference. It checks n-gram overlap, not semantic meaning.
N-gram matching: Typically calculated for 1-gram to 4-gram (BLEU-1 to BLEU-4).
Brevity Penalty: Penalizes overly short outputs to avoid gaming the score.
Score Range: 0 to 1 (or 0% to 100%). Higher is better.

📘 Example

Reference: "The cat is on the mat."
Candidate: "The cat sat on the mat."

BLEU-1 (unigram): 5 of 6 words match → high score
BLEU-2 (bigram): "the cat", "on the mat" match → lower score

🔗 Why Combine with Perplexity?

Perplexity measures fluency: How likely is the sentence under the language model?
BLEU measures fidelity: How closely does it match a reference?

Together, they offer a comprehensive evaluation:

BLEU/ROUGE → correctness
Perplexity → naturalness

🔢 BLEU Score Formula

The BLEU score is calculated using a modified n-gram precision with a brevity penalty:

\[ \text{BLEU} = \text{BP} \cdot \exp \left( \sum_{n=1}^N w_n \cdot \log p_n \right) \]

Where:

\( N \): Maximum n-gram order (e.g., 4 for BLEU-4)
\( w_n \): Weight for n-gram precision (usually \( w_n = \frac{1}{N} \))
\( p_n \): Modified precision for n-grams of size \( n \)
\( \text{BP} \): Brevity Penalty

✅ Modified n-gram precision

\[ \text{p_n} = \frac{\sum_{\text{n-gram} \in \text{candidate}} \min(\text{count}_\text{candidate}, \text{max\_ref\_count})}{\sum_{\text{n-gram} \in \text{candidate}} \text{count}_\text{candidate}} \]

✂️ Brevity Penalty (BP)

\[ \text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{(1 - \frac{r}{c})} & \text{if } c \leq r \end{cases} \]

\( c \): candidate length, \( r \): reference length

🧮 Worked-out BLEU Example (BLEU-2)

📝 Setup

Candidate: "the cat is on the mat"
Reference: "there is a cat on the mat"

🔹 Step 1: Extract n-grams

Unigrams
Candidate: ["the", "cat", "is", "on", "the", "mat"]
Reference: ["there", "is", "a", "cat", "on", "the", "mat"]

Bigrams
Candidate: ["the cat", "cat is", "is on", "on the", "the mat"]
Reference: ["there is", "is a", "a cat", "cat on", "on the", "the mat"]

🔹 Step 2: Modified Precision

Unigram: Matches = 5, Total = 6 → \( p_1 = \frac{5}{6} \)
Bigram: Matches = 2, Total = 5 → \( p_2 = \frac{2}{5} \)

🔹 Step 3: Brevity Penalty

\[ c = 6, \quad r = 7 \text{BP} = e^{1 - 7/6} = e^{-1/6} \approx 0.8465 \]

🔹 Step 4: Compute BLEU

\[ \text{BLEU} = 0.8465 \cdot \exp \left( 0.5 \cdot \log\left(\frac{5}{6}\right) + 0.5 \cdot \log\left(\frac{2}{5}\right) \right) = 0.8465 \cdot \exp(-0.549) \approx 0.8465 \cdot 0.577 \approx 0.4885 \]

Final BLEU-2 Score ≈ 0.4885 (or 48.85%)

💻 How to Implement BLEU in Python

✅ 1. Using NLTK

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

reference = [['there', 'is', 'a', 'cat', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'is', 'on', 'the', 'mat']

smoothie = SmoothingFunction().method1

score1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoothie)
score2 = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0), smoothing_function=smoothie)
score4 = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)

print(f"BLEU-1 Score: {score1:.4f}")
print(f"BLEU-2 Score: {score2:.4f}")
print(f"BLEU-4 Score: {score4:.4f}")

📦 To install NLTK

pip install nltk

import nltk
nltk.download('punkt')

⚙️ 2. Manual Implementation

If you want to build BLEU from scratch (e.g., BLEU-2), you can do so for educational purposes. Reach out for a step-by-step implementation if you're interested.

🔚 Conclusion

BLEU offers a powerful, quantitative way to assess the quality of generated text in tasks like translation and summarization. When used together with perplexity, you can evaluate both fidelity and fluency of your model outputs — a combination that’s essential for any serious NLP evaluation pipeline.

My Research Notes

Monday, 2 June 2025