Understanding Global Statistics in GloVe: A Deep Dive with Examples

GloVe (Global Vectors for Word Representation) is a widely used algorithm in natural language processing for learning word embeddings. What sets GloVe apart from models like Word2Vec is its use of global statistics — co-occurrence information aggregated over the entire corpus. This article explores what “global statistics” mean in GloVe, why they matter, and how they manifest in practical examples.

What Are Global Statistics in GloVe?

In the context of GloVe, global statistics refer to the comprehensive, corpus-wide counts of how often word pairs co-occur. Instead of examining a narrow window of neighboring words, GloVe analyzes the entire corpus to learn relationships between words based on co-occurrence frequencies. These statistics are organized in a co-occurrence matrix, where each entry \( X_{ij} \) indicates how often word j appears in the context of word i.

The heart of the GloVe algorithm lies in this equation:

\( w_i^T \cdot \tilde{w}_j + b_i + \tilde{b}_j \approx \log(X_{ij}) \)

Here, \( w_i \) and \( \tilde{w}_j \) are the word and context word vectors respectively, and \( X_{ij} \) is the co-occurrence count. The model tries to find word vectors such that their dot product (plus biases) approximates the logarithm of how frequently the words co-occur in the corpus.

Example Corpus

To understand how global statistics manifest, let’s consider a small sample corpus:

“I enjoy ice cream. I enjoy cold drinks. Ice is cold. Steam is hot.”

We focus on four target words: ice, steam, cold, and hot. Suppose we define a relatively large context window or treat each sentence as the context unit. We tally the number of times each word appears in the context of the others throughout the corpus. This yields a simplified co-occurrence matrix:

	ice	steam	cold	hot
ice	0	1	3	0
steam	1	0	1	3
cold	3	1	0	1
hot	0	3	1	0

These counts are global — they are accumulated over the entire dataset, not restricted to a single sentence or small context window.

Why Ratios Matter

GloVe emphasizes ratios of co-occurrence probabilities, which are more meaningful than raw counts. Consider the ratio of probabilities that a given context word (e.g., “cold” or “hot”) appears with “ice” versus “steam”:

For “ice”:

\( \frac{P(\text{cold} \mid \text{ice})}{P(\text{hot} \mid \text{ice})} = \frac{X_{\text{ice}, \text{cold}}}{X_{\text{ice}, \text{hot}}} = \frac{3}{\varepsilon} \approx \infty \)

For “steam”:

\( \frac{P(\text{cold} \mid \text{steam})}{P(\text{hot} \mid \text{steam})} = \frac{1}{3} \)

This sharp contrast in ratios reveals the temperature association of the words: “ice” relates more strongly to “cold” than “hot”, whereas “steam” relates more strongly to “hot” than “cold.” GloVe embeds these patterns into the geometry of the learned vector space.

How Global Statistics Are Computed

Let’s consider how we might construct the co-occurrence matrix in practice. Each sentence or context window is scanned, and for every pair of words, we increment their count. For example, “ice is cold” increments counts for:

\( X_{\text{ice}, \text{cold}} \)
\( X_{\text{cold}, \text{ice}} \)

When this process is applied to an entire corpus (possibly millions of sentences), the co-occurrence matrix captures global relationships between all word pairs — regardless of where in the text they appear. This is the key distinction between GloVe and Word2Vec.

GloVe vs Word2Vec: Local vs Global

Word2Vec uses a local window (typically ±5 words) to train its model using either CBOW or Skip-gram. It learns embeddings by predicting a word from its context or vice versa. In contrast, GloVe directly builds a global matrix and factorizes it using optimization techniques.

Aspect	GloVe	Word2Vec
Statistics Used	Global co-occurrence	Local context window
Learning Objective	Factorizes log co-occurrence matrix	Predicts surrounding words or targets
Strength	Captures semantic global relationships	Captures local syntactic patterns well

Semantic Meaning Through Vector Arithmetic

Because GloVe encodes global relationships, it supports rich vector operations such as:

\( \vec{\text{ice}} - \vec{\text{cold}} \approx \vec{\text{steam}} - \vec{\text{hot}} \)

This indicates that the difference between “ice” and “cold” is semantically similar to the difference between “steam” and “hot” — both reflecting the concept of “state of matter and its associated temperature.”

Conclusion

GloVe’s use of global statistics provides a powerful alternative to models based solely on local context. By building a co-occurrence matrix and learning embeddings through matrix factorization, GloVe captures rich, nuanced relationships between words. The example of “ice”, “steam”, “cold”, and “hot” demonstrates how these relationships emerge naturally from the data.

Global statistics matter because they reveal patterns that small, local windows might miss. In an age of deep learning, GloVe’s elegant use of global frequency ratios offers both theoretical clarity and practical power for understanding language.

My Research Notes

Wednesday, 21 May 2025