Understanding Word2Vec with Negative Sampling: A Step-by-Step Research Walkthrough

Keywords: Word Embeddings, Skip-gram, Negative Sampling, Word2Vec, NLP, Gradient Descent, Context Window

Introduction

Word2Vec is a foundational technique in natural language processing (NLP) that learns dense vector representations for words, known as word embeddings, based on their context within a corpus. Among its two architectures, the Skip-gram model with Negative Sampling stands out for both computational efficiency and the quality of learned embeddings.

In this article, we will explore the mathematical formulation and mechanics of negative sampling, work through a small example by hand, follow two iterations of gradient updates, and demystify how positive and negative context words are chosen during training.

The Skip-gram with Negative Sampling Objective

In the Skip-gram model, the goal is to predict context words given a center word. With negative sampling, this becomes a binary classification task: determine whether a word pair (center, context) is a true pair or a noise sample.

Given a center word $ w_c $ and an actual context word $ w_o $, the loss function is defined as:

\[ \mathcal{L} = - \left( \log \sigma(w_o^\top w_c) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)} \left[ \log \sigma(-w_k^\top w_c) \right] \right) \]

Here, $ \sigma(x) = \frac{1}{1 + e^{-x}} $ is the sigmoid function, $ w_o $ is the output (context) vector, and $ w_k $ are K negative samples drawn from a noise distribution $ P_n(w) $.

$\log σ (w_{o}^{⊤} w_{c})$ : encourages the dot product between center and real context word to be high (label = 1)

$\log σ (- w_{k}^{⊤} w_{c})$ : encourages the dot product between center and negative samples to be low (label = 0)

$P_{n} (w)$ : noise distribution (typically proportional to unigram frequency raised to the 3/4th power)

A Toy Example You Can Work by Hand

Let’s work through a concrete example with a small vocabulary of five words:

Vocabulary: ["cat", "dog", "apple", "banana", "run"]
Center word: "cat"
Context word: "dog" (positive)
Negative samples: "apple", "run"
Vector dimension: 2D embeddings for simplicity

Initial vectors (for input and output embeddings) are assumed:

w_in["cat"] = [1.0, 2.0]
w_out["dog"] = [1.0, 0.5]
w_out["apple"] = [1.0, 1.0]
w_out["run"] = [0.5, 0.0]

Dot products and sigmoid computations:

$ \sigma(w_o^\top w_c) = \sigma(2.0) \approx 0.88 $
$ \sigma(-w_k^\top w_c) = \sigma(-3.0) \approx 0.047 $ (for "apple")
$ \sigma(-0.5) \approx 0.38 $ (for "run")

Loss:

\[ \mathcal{L} \approx -(\log 0.88 + \log 0.047 + \log 0.38) \approx 4.15 \]

This high loss signals that embeddings need updates. Let's continue with gradient descent.

Gradient Descent: Two Full Iterations

Iteration 1:

Learning rate $ \eta = 0.1 $

Gradient w.r.t. center word:

\[ \frac{\partial \mathcal{L}}{\partial w_c} = (\sigma(w_o^\top w_c) - 1) \cdot w_o + \sum_k \sigma(w_k^\top w_c) \cdot w_k \]

Evaluated numerically:

Gradient = [-0.12, -0.06] + [-0.047, -0.047] + [-0.19, 0] = [-0.357, -0.107]
New w_in["cat"] = [1.0357, 2.0107]

Now update output vectors for dog, apple, run:

w_out["dog"]   = [1.0, 0.5] - 0.1 * [-0.12, -0.24] = [1.012, 0.524]
w_out["apple"] = [1.0, 1.0] - 0.1 * [0.047, 0.094] = [0.9953, 0.9906]
w_out["run"]   = [0.5, 0.0] - 0.1 * [0.38, 0.76] = [0.462, -0.076]

Iteration 2:

Updated vectors are used again with same center and context words.

New dot product (cat, dog): 2.10 → σ ≈ 0.89
New dot product (cat, apple): 2.51 → σ ≈ 0.92
New dot product (cat, run): 0.325 → σ ≈ 0.58

Loss:

\[ \mathcal{L} \approx -(\log 0.891 + \log 0.92 + \log 0.58) \approx 0.74 \]

The loss dropped from 4.15 to 0.74 — a clear sign that the model is learning effectively.

How Are Positive and Negative Context Words Chosen?

Positive Context Words

For each center word, a context window of size $ w $ is used.
Words within this window (before and after the center) are positive samples.

Example: In the sentence “the cat sat on the mat”, and window size = 2, the context of "cat" is ["the", "sat"].

Negative Context Words

Drawn from a noise distribution $ P_n(w) \propto f(w)^{3/4} $
This downweights very frequent words and slightly boosts rare ones
Typically 5–20 negatives per positive sample are drawn

For training, the model maximizes the likelihood of positive pairs and minimizes that of negative pairs.

Conclusion

Word2Vec with negative sampling turns the daunting task of predicting over the entire vocabulary into a series of simple, binary classification problems. Through dot products and sigmoid activations, and a clever sampling strategy, it learns meaningful embeddings that capture both syntactic and semantic regularities.

By walking through two complete training iterations and inspecting the mathematics behind positive and negative sampling, we've demystified how word embeddings evolve to bring semantically related words closer in vector space.

This method remains foundational in modern NLP, influencing later models like GloVe, FastText, and even deep transformer architectures that continue to rely on rich token embeddings.

My Research Notes

Saturday, 17 May 2025