Understanding Word2Vec with Negative Sampling: A Step-by-Step Research Walkthrough
Keywords: Word Embeddings, Skip-gram, Negative Sampling, Word2Vec, NLP, Gradient Descent, Context Window
Introduction
Word2Vec is a foundational technique in natural language processing (NLP) that learns dense vector representations for words, known as word embeddings, based on their context within a corpus. Among its two architectures, the Skip-gram model with Negative Sampling stands out for both computational efficiency and the quality of learned embeddings.
In this article, we will explore the mathematical formulation and mechanics of negative sampling, work through a small example by hand, follow two iterations of gradient updates, and demystify how positive and negative context words are chosen during training.
The Skip-gram with Negative Sampling Objective
In the Skip-gram model, the goal is to predict context words given a center word. With negative sampling, this becomes a binary classification task: determine whether a word pair (center, context) is a true pair or a noise sample.
Given a center word \( w_c \) and an actual context word \( w_o \), the loss function is defined as:
Here, \( \sigma(x) = \frac{1}{1 + e^{-x}} \) is the sigmoid function, \( w_o \) is the output (context) vector, and \( w_k \) are K negative samples drawn from a noise distribution \( P_n(w) \).
: encourages the dot product between center and real context word to be high (label = 1)
: encourages the dot product between center and negative samples to be low (label = 0)
: noise distribution (typically proportional to unigram frequency raised to the 3/4th power)
A Toy Example You Can Work by Hand
Let’s work through a concrete example with a small vocabulary of five words:
- Vocabulary: ["cat", "dog", "apple", "banana", "run"]
- Center word: "cat"
- Context word: "dog" (positive)
- Negative samples: "apple", "run"
- Vector dimension: 2D embeddings for simplicity
Initial vectors (for input and output embeddings) are assumed:
w_in["cat"] = [1.0, 2.0] w_out["dog"] = [1.0, 0.5] w_out["apple"] = [1.0, 1.0] w_out["run"] = [0.5, 0.0]
Dot products and sigmoid computations:
- \( \sigma(w_o^\top w_c) = \sigma(2.0) \approx 0.88 \)
- \( \sigma(-w_k^\top w_c) = \sigma(-3.0) \approx 0.047 \) (for "apple")
- \( \sigma(-0.5) \approx 0.38 \) (for "run")
Loss:
This high loss signals that embeddings need updates. Let's continue with gradient descent.
Gradient Descent: Two Full Iterations
Iteration 1:
- Learning rate \( \eta = 0.1 \)
Gradient w.r.t. center word:
Evaluated numerically:
Gradient = [-0.12, -0.06] + [-0.047, -0.047] + [-0.19, 0] = [-0.357, -0.107] New w_in["cat"] = [1.0357, 2.0107]
Now update output vectors for dog, apple, run:
w_out["dog"] = [1.0, 0.5] - 0.1 * [-0.12, -0.24] = [1.012, 0.524] w_out["apple"] = [1.0, 1.0] - 0.1 * [0.047, 0.094] = [0.9953, 0.9906] w_out["run"] = [0.5, 0.0] - 0.1 * [0.38, 0.76] = [0.462, -0.076]
Iteration 2:
Updated vectors are used again with same center and context words.
New dot product (cat, dog): 2.10 → σ ≈ 0.89 New dot product (cat, apple): 2.51 → σ ≈ 0.92 New dot product (cat, run): 0.325 → σ ≈ 0.58
Loss:
The loss dropped from 4.15 to 0.74 — a clear sign that the model is learning effectively.
How Are Positive and Negative Context Words Chosen?
Positive Context Words
- For each center word, a context window of size \( w \) is used.
- Words within this window (before and after the center) are positive samples.
Example: In the sentence “the cat sat on the mat”, and window size = 2, the context of "cat" is ["the", "sat"].
Negative Context Words
- Drawn from a noise distribution \( P_n(w) \propto f(w)^{3/4} \)
- This downweights very frequent words and slightly boosts rare ones
- Typically 5–20 negatives per positive sample are drawn
For training, the model maximizes the likelihood of positive pairs and minimizes that of negative pairs.
Conclusion
Word2Vec with negative sampling turns the daunting task of predicting over the entire vocabulary into a series of simple, binary classification problems. Through dot products and sigmoid activations, and a clever sampling strategy, it learns meaningful embeddings that capture both syntactic and semantic regularities.
By walking through two complete training iterations and inspecting the mathematics behind positive and negative sampling, we've demystified how word embeddings evolve to bring semantically related words closer in vector space.
This method remains foundational in modern NLP, influencing later models like GloVe, FastText, and even deep transformer architectures that continue to rely on rich token embeddings.
No comments:
Post a Comment