Understanding ROUGE and Its Role in Evaluating Language Generation Models

In natural language generation tasks like summarization and translation, evaluating the quality of generated text is essential. While perplexity measures fluency, and BLEU focuses on precision, ROUGE introduces a recall-oriented perspective, capturing how much of the reference text is represented in the generated output. In this post, we'll dive deep into ROUGE, work out a step-by-step example, and even implement it in Python.

What is ROUGE?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation systems by comparing the generated output (candidate) with one or more reference outputs. Unlike BLEU, which is more precision-oriented, ROUGE is recall-oriented—it measures how much of the reference text is captured by the generated output.

Key ROUGE Variants

ROUGE-N: Measures n-gram overlap between the candidate and the reference.
- ROUGE-1: Unigram (single word) overlap
- ROUGE-2: Bigram (two-word sequence) overlap
ROUGE-L: Based on the Longest Common Subsequence (LCS). Captures sentence-level structure similarity.
ROUGE-W: Weighted version of ROUGE-L, emphasizing consecutive matches more.
ROUGE-S / ROUGE-SU: Uses skip-bigrams (word pairs in order but not necessarily adjacent); SU adds unigrams as well.

ROUGE-N Formula

For ROUGE-N, the recall is computed as:


\text{ROUGE-N} = \frac{\sum_{\text{reference}} \text{Count}_{\text{match}}(n\text{-grams})}{\sum_{\text{reference}} \text{Count}_{\text{reference}}(n\text{-grams})}

Where:

\(\text{Count}_{\text{match}}\): number of n-grams in both candidate and reference
\(\text{Count}_{\text{reference}}\): total number of n-grams in the reference

Why Use ROUGE with Perplexity and BLEU?

Combining multiple metrics provides a more holistic view of a model's performance:

Perplexity: Measures fluency—how well the model predicts the next word
BLEU: Focuses on precision—how much of the candidate matches the reference
ROUGE: Focuses on recall—how much of the reference is captured by the candidate

Worked-Out Example of ROUGE-N

Reference Summary:

The cat sat on the mat

Generated Summary (Candidate):

The cat is on the mat

Tokenization:

Reference Tokens: ["The", "cat", "sat", "on", "the", "mat"]
Candidate Tokens: ["The", "cat", "is", "on", "the", "mat"]

ROUGE-1 (Unigram Overlap)

Word	In Reference	In Candidate	Match
the	2	2	2
cat	1	1	1
sat	1	0	0
on	1	1	1
mat	1	1	1
is	0	1	0

Total Matches: 5

Total Unigrams in Reference: 6


\text{ROUGE-1 (Recall)} = \frac{5}{6} \approx 0.833 \Rightarrow 83.3\%

ROUGE-2 (Bigram Overlap)

Reference Bigrams: ["The cat", "cat sat", "sat on", "on the", "the mat"]
Candidate Bigrams: ["The cat", "cat is", "is on", "on the", "the mat"]

Common Bigrams: "The cat", "on the", "the mat"

Total Matches: 3

Total Bigrams in Reference: 5


\text{ROUGE-2 (Recall)} = \frac{3}{5} = 0.6 \Rightarrow 60\%

Final Scores:

Metric	Score
ROUGE-1	83.3%
ROUGE-2	60%

Python Implementation of ROUGE-1 and ROUGE-2


from collections import Counter

def get_ngrams(tokens, n):
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

def rouge_n(reference, candidate, n=1):
    ref_ngrams = get_ngrams(reference, n)
    cand_ngrams = get_ngrams(candidate, n)

    ref_counter = Counter(ref_ngrams)
    cand_counter = Counter(cand_ngrams)

    # Calculate overlap
    overlap = sum((ref_counter & cand_counter).values())
    total_ref = sum(ref_counter.values())

    recall = overlap / total_ref if total_ref > 0 else 0.0
    return recall

# Sample texts
reference_text = "The cat sat on the mat"
candidate_text = "The cat is on the mat"

# Tokenize
ref_tokens = reference_text.lower().split()
cand_tokens = candidate_text.lower().split()

# Compute ROUGE-1 and ROUGE-2
rouge_1 = rouge_n(ref_tokens, cand_tokens, n=1)
rouge_2 = rouge_n(ref_tokens, cand_tokens, n=2)

# Output
print(f"ROUGE-1 Recall: {rouge_1:.2%}")
print(f"ROUGE-2 Recall: {rouge_2:.2%}")

Expected Output:


ROUGE-1 Recall: 83.33%
ROUGE-2 Recall: 60.00%

Conclusion

ROUGE is an essential metric when you want to evaluate how much of the original or reference content is preserved in the generated summary or translation. When combined with perplexity (fluency) and BLEU (precision), it provides a well-rounded assessment of language model quality.

My Research Notes

Monday, 2 June 2025