Why Word2Vec Uses \( f(w)^{3/4} \) for Negative Sampling
Author: Priyank Goyal
Keywords: Word2Vec, Negative Sampling, Word Frequencies, NLP, Noise Distribution
Introduction
One of the key innovations that made Word2Vec fast and scalable was the introduction of Negative Sampling. Instead of computing a softmax across the entire vocabulary, Word2Vec simplifies training to a binary classification problem: distinguishing real context words from randomly sampled noise words. But how these negative words are sampled is not arbitrary — it is governed by a carefully shaped probability distribution:
Here, \( f(w) \) is the raw frequency of word \( w \) in the corpus. This article explores the reasoning behind the use of the 3/4th power, how it affects training, and why it helps produce better word embeddings.
Why Not Use Raw Frequency?
In theory, one could sample negative words in proportion to their raw frequencies \( f(w) \), but in practice, this approach causes problems. Highly frequent words such as "the", "is", and "and" dominate the negative samples. Since these words are common in most contexts, they do not provide a meaningful learning signal when used as negatives. This can dilute the model’s ability to learn nuanced word relationships.
To mitigate this, the raw frequencies are transformed using a sublinear function:
This transformation flattens the frequency distribution: frequent words are downweighted while rare words are slightly boosted. It strikes a balance between:
- avoiding over-representation of stop words, and
- still reflecting natural word usage patterns.
The Intuition Behind \( f(w)^{3/4} \)
Mathematically, \( x^{3/4} \) grows slower than \( x \), especially as \( x \) increases. Consider the following comparison:
| \( f(w) \) | \( f(w)^{3/4} \) |
|---|---|
| 1 | 1.0 |
| 10 | 5.6 |
| 100 | 31.6 |
| 1000 | 178 |
This softens the influence of extremely common words, but still gives them a chance to appear in negative samples. As a result, the model gets better contrastive learning examples, which improves the quality of the embeddings.
In visual terms, comparing \( y = x \) and \( y = x^{3/4} \) clearly shows how the latter grows more slowly. This “compression” of frequency is what makes the noise distribution more balanced and effective.
Conclusion
The use of \( f(w)^{3/4} \) in Word2Vec’s negative sampling isn’t just a mathematical trick — it’s a principled decision that enhances the model’s learning capability. By reshaping the word frequency distribution, Word2Vec ensures that negative samples are informative, diverse, and better suited to help the model learn meaningful word representations.
This clever adjustment helps Word2Vec remain efficient while maintaining high-quality results, and has since inspired similar strategies in other embedding techniques and contrastive learning frameworks.
.png)
No comments:
Post a Comment