Wednesday, 21 May 2025

Matrix Factorization and Dimensionality Reduction in Word Embeddings: From LSA to GloVe

Matrix Factorization and Dimensionality Reduction in Word Embeddings: From LSA to GloVe

Abstract: This blog explores the theoretical connection between Latent Semantic Analysis (LSA) and the GloVe algorithm through the lens of matrix factorization and dimensionality reduction. We examine how high-dimensional word relationships are compressed into dense semantic vector spaces and highlight the importance of co-occurrence statistics in shaping word meaning.

1. Introduction

Word embeddings have revolutionized natural language processing by enabling algorithms to understand and manipulate language in vector space. Two well-known approaches for generating embeddings are Latent Semantic Analysis (LSA) and GloVe (Global Vectors for Word Representation). Although developed in different contexts, both can be understood as forms of matrix factorization applied to linguistic data. This article offers a comparative analysis of LSA and GloVe, emphasizing how each reduces high-dimensional co-occurrence information into low-dimensional semantic representations.

2. Understanding Matrix Factorization in NLP

Matrix factorization techniques in NLP aim to uncover latent semantic structures hidden in large text corpora. Typically, we begin with a sparse matrix \( X \) of high dimensionality, where each entry represents the relationship between words and contexts—such as word-document frequencies or word-word co-occurrence counts.

The goal is to approximate this matrix with a product of lower-dimensional matrices. In mathematical terms, this is written as:

\( X \approx U \Sigma V^T \)

Here:

  • \( X \in \mathbb{R}^{m \times n} \): Original high-dimensional matrix
  • \( U \in \mathbb{R}^{m \times k} \), \( V \in \mathbb{R}^{n \times k} \): Lower-dimensional factor matrices
  • \( \Sigma \in \mathbb{R}^{k \times k} \): Diagonal matrix of singular values

This decomposition forms the basis of LSA and inspires models like GloVe, albeit with modifications to the objective and input matrix.

3. LSA: Latent Semantic Analysis

LSA is a classical method based on Singular Value Decomposition (SVD). It starts with a term-document matrix or a term-term matrix. The values are often weighted using techniques like TF-IDF to better capture importance.

The SVD of this matrix yields a set of lower-dimensional word vectors that retain the dominant patterns of word usage across documents. These vectors can be interpreted as coordinates in a "semantic space," where words with similar meanings appear closer together.

Strengths:

  • Captures linear semantic relationships
  • Uses global corpus statistics
  • Unsupervised and interpretable

Limitations:

  • Computationally expensive for large corpora
  • Embeddings may be sensitive to rare words
  • Does not model nonlinear relationships

4. GloVe: Global Vectors for Word Representation

GloVe extends the idea of matrix factorization by focusing on the co-occurrence matrix \( X_{ij} \), where each entry represents the number of times word \( j \) appears in the context of word \( i \).

The key insight of GloVe is to model the logarithm of the co-occurrence count as the inner product of word vectors:

\( w_i^T \tilde{w}_j + b_i + \tilde{b}_j \approx \log(X_{ij}) \)

Where:

  • \( w_i \), \( \tilde{w}_j \): Word and context vectors
  • \( b_i \), \( \tilde{b}_j \): Bias terms

The loss function minimized by GloVe is a weighted least squares objective:

\( J = \sum_{i,j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2 \)

The weighting function \( f(X_{ij}) \) ensures that very frequent or very rare co-occurrences don't dominate the loss.

Benefits of GloVe:

  • Explicitly leverages global co-occurrence statistics
  • Captures linear substructures in word relationships
  • Generates dense word vectors that excel in analogical reasoning

5. High-Dimensional to Low-Dimensional Embeddings

In raw form, each word is associated with a sparse, high-dimensional vector—e.g., a vector of co-occurrence counts with all other words. For a vocabulary of 100,000 words, this means 100,000-dimensional space per word. This is both computationally intensive and semantically noisy.

Dimensionality reduction aims to find a much smaller vector—say, 100 or 300 dimensions—that still captures the word's most important relationships. This process compresses redundant or irrelevant information while preserving semantic structure.

In the resulting lower-dimensional space:

  • Semantically similar words are located nearby (high cosine similarity)
  • Analogical relationships are represented by vector arithmetic

For instance, the famous analogy:

\( \text{king} - \text{man} + \text{woman} \approx \text{queen} \)

is a direct result of the geometric properties of embeddings learned via GloVe.

6. Comparative View: LSA vs. GloVe

Feature LSA GloVe
Input Matrix Term-Document or Term-Term Word-Word Co-occurrence
Factorization Singular Value Decomposition (SVD) Weighted Least Squares on log co-occurrence
Use of Logarithm Optional pre-processing Built-in to the model
Objective Minimize reconstruction error Approximate log(co-occurrence)
Scalability Less scalable to large corpora Highly scalable and efficient

7. Example: Co-occurrence Counts and Embeddings

Consider a small vocabulary where the word “ice” co-occurs with “cold” more frequently than “steam”. GloVe would model this as:

  • \( X_{\text{ice,cold}} = 50 \Rightarrow \log(50) \approx 3.91 \)
  • \( X_{\text{ice,steam}} = 1 \Rightarrow \log(1) = 0 \)

The learned vectors are optimized so that:

\( w_{\text{ice}}^T \tilde{w}_{\text{cold}} \approx 3.91 \quad \text{and} \quad w_{\text{ice}}^T \tilde{w}_{\text{steam}} \approx 0 \)

This ensures semantic relationships are preserved geometrically.

8. Conclusion

Matrix factorization methods like LSA and GloVe provide powerful tools for learning meaningful word representations. While LSA relies on linear algebra and SVD, GloVe builds upon global co-occurrence statistics to learn vectors through a weighted regression model. Both methods exemplify the principle of reducing high-dimensional word relationships into low-dimensional vector spaces, a cornerstone of modern NLP.

As NLP continues to evolve, understanding these foundational techniques helps us appreciate the strengths and limitations of newer models like BERT and GPT that build upon these semantic representations in more complex ways.

No comments:

Post a Comment

🧠 You Only Laugh Once: Creativity and Humor in Deep Learning Community

It all started with a simple truth: Attention Is All You Need . Or at least, that’s what the transformers keep whispering at every AI confer...