Why Likelihood and Probability Use the Same Notation (And Why It’s Confusing)
You might want to read this first: "Demystifying Likelihood: From Probability to Parameter Estimation"
In the study of statistics and machine learning, it's common to encounter expressions like:
This notation is used in both probability and likelihood contexts. However, despite being written identically, these two concepts have fundamentally different interpretations depending on what's considered known or unknown. This blog post explores why this notation is shared, what it means in each context, and how to avoid confusion.
Probability vs. Likelihood: A Functional Difference
Let’s begin with a simple but powerful distinction:
| Concept | Interpretation | Known Variable | Unknown Variable |
|---|---|---|---|
| Probability | How likely is the data, given a model? | \( \theta \) (model) | \( x \) (data) |
| Likelihood | How plausible is the model, given the data? | \( x \) (data) | \( \theta \) (model) |
The mathematical form is the same in both cases: \( P(x \mid \theta) \). But what we treat as constant and what we treat as variable is different.
Understanding Likelihood Function
In likelihood-based inference, we treat the data as observed and fixed. Our goal is to find the value of \( \theta \) that best explains the data. This leads to the likelihood function:
And if we have multiple independent data points \( x^{(1)}, x^{(2)}, \ldots, x^{(n)} \), the total likelihood becomes:
We use this function to perform Maximum Likelihood Estimation (MLE) by finding the value of \( \theta \) that maximizes \( L(\theta) \).
Why Use the Same Notation?
This is where the confusion arises. Why do we use \( P(x \mid \theta) \) for both probability and likelihood if they represent different conceptual directions?
1. It's the Same Mathematical Function
Whether you are calculating the probability of the data or using it as a function of the model parameters, the mathematical form of the expression doesn’t change. What changes is the perspective and the role of the variables.
2. Historical Convention
Historically, early statisticians such as R.A. Fisher used the same expression for both purposes. Over time, this became convention, even though it can cause misunderstanding.
3. Notational Simplicity
Rather than introducing entirely new symbols (e.g., \( \mathcal{L}(\theta) \)), we stick to the familiar form. However, this does put the burden on the reader to understand the context clearly.
Common Misunderstanding: Likelihood is Not \( P(\theta \mid x) \)
One of the most common misconceptions is to confuse likelihood with the posterior probability in Bayesian inference. But:
To move from likelihood to posterior, we need to apply Bayes’ theorem:
Here, \( P(\theta) \) is the prior belief about \( \theta \), and \( P(x) \) is a normalizing constant.
Log-Likelihood: A Common Alternative
In practice, we often use the logarithm of the likelihood function to simplify computation:
This is because products of probabilities can become numerically unstable for large datasets. Taking the log turns products into sums, making optimization more tractable.
Better Notations (Sometimes Used)
To avoid confusion, some authors write:
- \( \mathcal{L}(\theta; x) \) for likelihood, to separate it from probability notation
- \( \ell(\theta) = \log \mathcal{L}(\theta) \) for the log-likelihood
However, in most literature, you will still see \( P(x \mid \theta) \) used for both interpretations, so understanding the context is critical.
Summary
The use of the same notation \( P(x \mid \theta) \) for both probability and likelihood arises from mathematical identity and historical convention. While the form is the same, the interpretations differ significantly based on what is treated as known and what is variable.
Key takeaway: The confusion is not in the math, but in the mindset. Probability asks, “What data will I see, given this model?” Likelihood asks, “What model best explains the data I have seen?”
To avoid misinterpretation, always clarify the role of the variables and consider using alternate notation or diagrams to reinforce the intended meaning.
If you found this article useful, consider exploring related topics such as Maximum Likelihood Estimation, Bayesian Inference, and Log-Likelihood Optimization in machine learning.
No comments:
Post a Comment