What Are Logits? Understanding the Raw Scores Behind Probabilities
In the world of machine learning, especially deep learning, you will often come across the term "logits". Whether you're working with classification models, interpreting neural network outputs, or dealing with contrastive learning (such as CLIP or Word2Vec), you'll find phrases like: "Use these similarities as logits." But what exactly are logits, and why are they so central to predictive modeling?
Definition: What Are Logits?
In simple terms, logits are the real-valued output scores of a model just before they are passed into an activation function like
softmax or sigmoid. They are unnormalized scores — meaning they are not probabilities yet.
You can think of logits as the model's raw "votes" or "confidence levels" for each class.
For example, consider a model output:
[3.5, -1.2, 0.0]
These values are called logits. When passed through a softmax function, they are converted into a valid probability distribution that sums to 1.
Origin of the Term: Logits and Log-Odds
The word logit originates from statistics, specifically from logistic regression. In logistic regression, the model outputs a probability between 0 and 1 using the sigmoid function. However, before applying sigmoid, we compute the log-odds, which are given by:
\[ \text{logit}(p) = \log\left(\frac{p}{1 - p}\right) \]
This function maps probabilities from the interval (0, 1) to the entire real number line \((-\infty, \infty)\). The resulting score is also called a logit. This mapping provides a nice mathematical foundation for binary classification.
In deep learning, the terminology stuck: any real-valued number that is the input to a sigmoid or softmax is now called a logit — even if it didn’t literally come from a log-odds calculation.
Why Are Logits Useful?
One of the key advantages of logits is that they preserve the full range of the real number line. This means a model is free to express any level of confidence without constraints:
- A large positive logit → high confidence in that class
- A large negative logit → low confidence
- Zero logit → neutral score
Once logits are passed through softmax, they become probabilities. But softmax also retains the relative differences between logits, making it sensitive to their scale — a feature we exploit in many model designs.
“Use Those Similarities as Logits” – What Does That Mean?
This phrase often arises in contrastive learning setups like CLIP (Contrastive Language-Image Pretraining), where the model compares the similarity between images and text captions.
Suppose you compute cosine similarity between an image and a set of captions, and get:
[0.85, 0.3, -0.1]
These are similarity scores — not probabilities. But you can use them as logits by feeding them into a softmax function:
\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]
This transforms the similarities into a probability distribution across captions — indicating which one best matches the image. In this context, the raw similarities are treated as if they were logits.
Can Any Number Be a Logit?
Yes — and this is key. Any real number can be treated as a logit. Logits have no constraints:
- They can be negative or positive
- They don’t have to be normalized
- They are used before any probability computation
This flexibility makes logits very convenient for learning, optimization, and expressing uncertainty in neural networks.
Why Not Just Use Probabilities Directly?
You might wonder: Why not skip logits and directly output probabilities? Here's why:
- Better for learning: Gradients from loss functions like cross-entropy behave better when applied to logits instead of probabilities.
- Numerical stability: Libraries like PyTorch and TensorFlow compute cross-entropy directly from logits to avoid instability.
- Flexibility: Logits allow a model to express raw confidence before normalizing, giving richer learning signals.
Visualizing Softmax on Logits
Let’s visualize how softmax transforms logits. Suppose our model gives:
logits = [2.0, 1.0, 0.1]
After softmax:
probs = softmax(logits) ≈ [0.65, 0.24, 0.11]
The class with the highest logit gets the highest probability. The transformation is smooth, differentiable, and retains ranking.
Conclusion
Understanding logits is crucial for interpreting model behavior in classification and contrastive learning. Though the name “logit” comes from log-odds in logistic regression, in deep learning it refers to the raw scores before normalization. Whether your model outputs similarities, distances, or linear projections, as long as you pass them through softmax or sigmoid, you can treat them as logits.
So yes — any number can be treated as a logit. It’s the context (usually a softmax or sigmoid activation) that turns it into a meaningful probability.
And now, next time someone says "use those similarities as logits," you'll know exactly what they mean.
No comments:
Post a Comment