Understanding ROUGE and Its Role in Evaluating Language Generation Models
In natural language generation tasks like summarization and translation, evaluating the quality of generated text is essential. While perplexity measures fluency, and BLEU focuses on precision, ROUGE introduces a recall-oriented perspective, capturing how much of the reference text is represented in the generated output. In this post, we'll dive deep into ROUGE, work out a step-by-step example, and even implement it in Python.
What is ROUGE?
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation systems by comparing the generated output (candidate) with one or more reference outputs. Unlike BLEU, which is more precision-oriented, ROUGE is recall-oriented—it measures how much of the reference text is captured by the generated output.
Key ROUGE Variants
- ROUGE-N: Measures n-gram overlap between the candidate and the reference.
- ROUGE-1: Unigram (single word) overlap
- ROUGE-2: Bigram (two-word sequence) overlap
- ROUGE-L: Based on the Longest Common Subsequence (LCS). Captures sentence-level structure similarity.
- ROUGE-W: Weighted version of ROUGE-L, emphasizing consecutive matches more.
- ROUGE-S / ROUGE-SU: Uses skip-bigrams (word pairs in order but not necessarily adjacent); SU adds unigrams as well.
ROUGE-N Formula
For ROUGE-N, the recall is computed as:
\text{ROUGE-N} = \frac{\sum_{\text{reference}} \text{Count}_{\text{match}}(n\text{-grams})}{\sum_{\text{reference}} \text{Count}_{\text{reference}}(n\text{-grams})}
Where:
- \(\text{Count}_{\text{match}}\): number of n-grams in both candidate and reference
- \(\text{Count}_{\text{reference}}\): total number of n-grams in the reference
Why Use ROUGE with Perplexity and BLEU?
Combining multiple metrics provides a more holistic view of a model's performance:
- Perplexity: Measures fluency—how well the model predicts the next word
- BLEU: Focuses on precision—how much of the candidate matches the reference
- ROUGE: Focuses on recall—how much of the reference is captured by the candidate
Worked-Out Example of ROUGE-N
Reference Summary:
The cat sat on the mat
Generated Summary (Candidate):
The cat is on the mat
Tokenization:
- Reference Tokens: ["The", "cat", "sat", "on", "the", "mat"]
- Candidate Tokens: ["The", "cat", "is", "on", "the", "mat"]
ROUGE-1 (Unigram Overlap)
| Word | In Reference | In Candidate | Match |
|---|---|---|---|
| the | 2 | 2 | 2 |
| cat | 1 | 1 | 1 |
| sat | 1 | 0 | 0 |
| on | 1 | 1 | 1 |
| mat | 1 | 1 | 1 |
| is | 0 | 1 | 0 |
Total Matches: 5
Total Unigrams in Reference: 6
\text{ROUGE-1 (Recall)} = \frac{5}{6} \approx 0.833 \Rightarrow 83.3\%
ROUGE-2 (Bigram Overlap)
- Reference Bigrams: ["The cat", "cat sat", "sat on", "on the", "the mat"]
- Candidate Bigrams: ["The cat", "cat is", "is on", "on the", "the mat"]
Common Bigrams: "The cat", "on the", "the mat"
Total Matches: 3
Total Bigrams in Reference: 5
\text{ROUGE-2 (Recall)} = \frac{3}{5} = 0.6 \Rightarrow 60\%
Final Scores:
| Metric | Score |
|---|---|
| ROUGE-1 | 83.3% |
| ROUGE-2 | 60% |
Python Implementation of ROUGE-1 and ROUGE-2
from collections import Counter
def get_ngrams(tokens, n):
return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
def rouge_n(reference, candidate, n=1):
ref_ngrams = get_ngrams(reference, n)
cand_ngrams = get_ngrams(candidate, n)
ref_counter = Counter(ref_ngrams)
cand_counter = Counter(cand_ngrams)
# Calculate overlap
overlap = sum((ref_counter & cand_counter).values())
total_ref = sum(ref_counter.values())
recall = overlap / total_ref if total_ref > 0 else 0.0
return recall
# Sample texts
reference_text = "The cat sat on the mat"
candidate_text = "The cat is on the mat"
# Tokenize
ref_tokens = reference_text.lower().split()
cand_tokens = candidate_text.lower().split()
# Compute ROUGE-1 and ROUGE-2
rouge_1 = rouge_n(ref_tokens, cand_tokens, n=1)
rouge_2 = rouge_n(ref_tokens, cand_tokens, n=2)
# Output
print(f"ROUGE-1 Recall: {rouge_1:.2%}")
print(f"ROUGE-2 Recall: {rouge_2:.2%}")
Expected Output:
ROUGE-1 Recall: 83.33%
ROUGE-2 Recall: 60.00%
Conclusion
ROUGE is an essential metric when you want to evaluate how much of the original or reference content is preserved in the generated summary or translation. When combined with perplexity (fluency) and BLEU (precision), it provides a well-rounded assessment of language model quality.
No comments:
Post a Comment