Understanding Word Embeddings in Distributional Models of NLP

In the field of Natural Language Processing (NLP), word embeddings have become foundational for representing text in a form that machines can understand. They are particularly central to distributional models, which learn word meaning based on the principle that “you shall know a word by the company it keeps.” This article unpacks what word embeddings are, how their dimensionality is chosen, what the dimensions represent, and how the values in each dimension are learned.

What Are Word Embeddings?

A word embedding is a dense vector representation of a word in a continuous vector space. Unlike one-hot vectors, which are sparse and lack relational meaning, word embeddings encode semantic and syntactic similarity between words. Words that appear in similar contexts — such as “cat” and “dog” — are located close together in the vector space.

These embeddings are at the core of distributional models, which learn from large corpora by analyzing word co-occurrence statistics. The embeddings capture distributional semantics, i.e., the meaning of a word is derived from the words it co-occurs with.

Why Not Use One-Hot Vectors?

Before embeddings, each word was represented as a one-hot vector — a binary vector where only one position is 1 and the rest are 0. This approach has serious limitations:

It treats all words as equally unrelated.
It creates extremely sparse vectors (most entries are zero).
It does not capture any semantic or syntactic relationships.

Word embeddings solve these problems by learning dense, low-dimensional vectors where distances and directions encode linguistic properties.

How Do We Choose the Number of Dimensions?

The number of dimensions in a word embedding is a hyperparameter. It is not learned from data but selected in advance based on task complexity, data size, and computational resources. Common choices range from 100 to 300 dimensions for general NLP tasks. Here are some considerations:

Small datasets: 50–100 dimensions are often sufficient.
Large corpora and complex tasks: 200–300 dimensions are more effective.
Empirical tuning: Try multiple values (e.g., 100, 200, 300) and evaluate downstream performance.

A rough heuristic suggested in literature is:

\[ \text{Embedding dimension} \approx \log_2(V) \]

where \( V \) is the vocabulary size. However, this tends to underestimate the size in practice. Word2Vec and GloVe, for instance, commonly use 300 dimensions.

What Do the Dimensions Capture?

Each dimension in an embedding vector captures a latent aspect of meaning. These aspects are not human-interpretable individually, but they emerge from patterns in how words appear in context.

For example, some dimensions may loosely relate to:

Gender (king vs. queen)
Verb tense (run vs. ran)
Sentiment (good vs. bad)
Domain (bank in "river bank" vs. "savings bank")

These meanings are distributed across the entire vector. You cannot point to a specific dimension and say “this is the gender feature,” but relationships like the following demonstrate that meaning is encoded:

\[ \text{vec("king")} - \text{vec("man")} + \text{vec("woman")} \approx \text{vec("queen")} \]

Thus, dimensions work together to form a rich, geometry-based representation of word semantics.

How Are the Values in Each Dimension Calculated?

The actual values in each dimension of a word vector are learned through optimization over a massive corpus. This learning process depends on the specific model used. Two major types are:

1. Predictive Models (e.g., Word2Vec)

These models learn embeddings by predicting context words from a target word, or vice versa.

Skip-Gram Model: Given a center word \( w_c \), predict a context word \( w_o \). The probability of the context word is computed as:

\[ P(w_o \mid w_c) = \frac{\exp(\vec{v}_{w_o} \cdot \vec{v}_{w_c})}{\sum_{w \in V} \exp(\vec{v}_w \cdot \vec{v}_{w_c})} \]

This is optimized using gradient descent. Each training step updates the values in each dimension of the vectors to minimize prediction error, adjusting the geometry of the embedding space.

2. Count-Based Models (e.g., GloVe)

These models start with a word co-occurrence matrix \( X \), where \( X_{ij} \) is the frequency with which word \( i \) appears in the context of word \( j \). GloVe tries to learn vectors such that:

\[ \vec{w}_i \cdot \vec{c}_j + b_i + b_j \approx \log(X_{ij}) \]

It uses a weighted least squares objective function:

\[ J = \sum_{i,j} f(X_{ij})(\vec{w}_i \cdot \vec{c}_j + b_i + b_j - \log(X_{ij}))^2 \]

Again, gradient descent is used to iteratively update the values in each dimension to minimize the reconstruction error.

Training Summary

Stage	What Happens
Initialization	Vectors are assigned small random values
Context Learning	Model predicts or reconstructs context relationships
Optimization	Vectors are adjusted via gradient descent
Final Embedding	Each word has a vector capturing its meaning via geometry

Conclusion

Word embeddings are dense vector representations that allow machines to understand language in a relational, geometric way. The dimensions in these embeddings do not have fixed meanings, but they capture complex patterns in word usage through statistical learning. The number of dimensions is chosen empirically, and the values in each dimension are adjusted during training to optimize a specific objective function—either to predict context words or to reconstruct co-occurrence relationships.

By understanding how embeddings are structured, trained, and interpreted, we gain powerful insight into how modern NLP models represent and process human language.

Are Word Embedding Dimensions Interpretable?

Word embeddings have become a cornerstone of modern natural language processing (NLP). They transform words into fixed-length, dense vectors in high-dimensional space, allowing machine learning models to operate on semantic features derived from text. A common question that arises is: Do embeddings for all words share the same internal structure or representation? More specifically, can we say that dimension 1 in the vector for word A represents the same concept as dimension 1 in the vector for word X? And if not, how do we justify the use of dot products and cosine similarity as measures of semantic similarity?

Do Word Embeddings Share the Same Internal Representations?

The short answer is no. While all word embeddings are of the same length (e.g., 300 dimensions), they do not share a standardized or interpretable internal structure. That is, dimension 1 in the vector for “apple” does not carry the same semantic meaning as dimension 1 in the vector for “engine.” This is a byproduct of how embeddings are trained: through unsupervised learning objectives that focus on co-occurrence prediction or matrix factorization rather than explicit semantic annotation.

What If We Compare Position 1 Across Two Words?

When you ask whether A₁ and X₁ (i.e., the first dimension of two different word vectors) capture similar properties across all words — the answer is still, in general, no. Word embedding training is rotationally invariant, meaning that the geometric structure can be rotated in space without affecting relative distances or dot products. Thus, the position of a specific value does not imply a consistent meaning across words. These models produce what are known as distributed representations, where semantic meaning is encoded across all dimensions in combination, not isolated in any single position.

So How Is the Dot Product Meaningful?

This leads to an important and subtle point. Even though individual dimensions are not interpretable, the dot product between two embedding vectors — or more commonly, the cosine similarity — is highly meaningful. Why? Because during training (e.g., using the Word2Vec Skip-Gram model), embeddings are explicitly optimized to increase the dot product between words that appear in similar contexts:

\[ P(w_o \mid w_c) = \frac{\exp(\vec{v}_{w_o} \cdot \vec{v}_{w_c})}{\sum_{w \in V} \exp(\vec{v}_w \cdot \vec{v}_{w_c})} \]

This means that similar words (e.g., “king” and “queen”) are embedded in such a way that their vectors point in nearly the same direction. The dot product measures this alignment, regardless of the meanings of individual dimensions. In practice, semantic similarity emerges from the geometry of the entire vector, not from any particular coordinate.

Conclusion

Although the individual positions in word embedding vectors are not semantically aligned across words, the dot product between vectors is still a valid and effective measure of similarity. This is because similarity in NLP embeddings arises from relative positions in vector space, not from a shared semantic labeling of individual dimensions. Understanding this distinction is critical to interpreting and applying word embeddings in research and downstream NLP tasks.

My Research Notes

Saturday, 17 May 2025