Demystifying Likelihood: From Probability to Parameter Estimation

In the world of statistics and machine learning, the term likelihood is often encountered but not always well understood. At first glance, it may seem like just another word for probability — after all, we talk about the "likelihood of rain" or the "likelihood of success" in everyday speech. But in statistical modeling, likelihood has a very specific and powerful meaning — one that lies at the heart of parameter estimation and learning from data.

In this article, we’ll unpack what likelihood really means, how it differs from probability, and why it plays a central role in statistical modeling and machine learning.

Probability vs. Likelihood: A Crucial Distinction

Let’s start with a fundamental distinction that often trips people up:

Concept	What It Describes	Variable Treated as Known	Variable Treated as Unknown
Probability	How likely the data is, given a model	Model parameters (θ)	Data (x)
Likelihood	How plausible a model is, given the data	Data (x)	Model parameters (θ)

In simple terms:

- Probability is used when the model is fixed and we want to know how likely a particular outcome is.
- Likelihood is used when the data is fixed, and we want to know how plausible different models (parameter settings) are given the observed data.

This distinction becomes especially important when we start estimating parameters from data.

Remember: Notation is the same P(x|theta) for both probability and likelihood. Why. See this blog here

A Coin Toss Example

Suppose you flip a coin 10 times and observe 7 heads and 3 tails. You might wonder: what’s the probability of observing this outcome?

If you assume the coin is fair (θ = 0.5), the probability of 7 heads in 10 tosses is:

\[
P(X = 7 \mid \theta = 0.5) = \binom{10}{7} (0.5)^7 (0.5)^3
\]

But if you don’t know the bias of the coin and want to estimate the value of θ (the probability of heads), you flip the question:

Given that I observed 7 heads, what value of θ makes this observation most likely?

This leads to the likelihood function:

\[
L(\theta) = \binom{10}{7} \theta^7 (1 - \theta)^3
\]

This is no longer just a number — it’s a function of θ. You can now plot this function or optimize it to find the maximum likelihood estimate (MLE) — the value of θ that makes the data most likely. In this case, the MLE is:

\[
\hat{\theta} = \frac{7}{10} = 0.7
\]

Why Take the Log? Enter Log-Likelihood

When you have many data points, the likelihood becomes a product of many probabilities, which can get very small and unstable:

\[
L(\theta) = \prod_{i=1}^{n} P(x^{(i)} \mid \theta)
\]

To make things numerically stable and easier to differentiate, we take the log of the likelihood:

\[
\log L(\theta) = \sum_{i=1}^{n} \log P(x^{(i)} \mid \theta)
\]

This is called the log-likelihood.

In our coin example, the log-likelihood becomes:

\[
\log L(\theta) = \log \binom{10}{7} + 7 \log \theta + 3 \log (1 - \theta)
\]

We often drop constants like \( \log \binom{10}{7} \) when optimizing, since they don’t affect the maximum point. Why. Read "Why Constants Are Ignored in Log-Likelihood and the Role of Argmax"

In Machine Learning: Assumed Models and Likelihood

Machine learning often begins with the assumption:

We don’t know how the data was generated, so we assume a model — usually a linear (or nonlinear) function with parameters.

For example, in logistic regression, we assume the probability of a binary outcome is:

\[
P(y = 1 \mid x; \theta) = \sigma(\theta^T x)
\]

where \( \sigma(z) = \frac{1}{1 + e^{-z}} \) is the sigmoid function.

Now, for a dataset \( \{(x^{(i)}, y^{(i)})\}_{i=1}^n \), we can write the likelihood function:

\[
L(\theta) = \prod_{i=1}^{n} \left[ \sigma(\theta^T x^{(i)}) \right]^{y^{(i)}} \left[1 - \sigma(\theta^T x^{(i)}) \right]^{1 - y^{(i)}}
\]

Taking logs gives us the log-likelihood (which we maximize):

\[
\log L(\theta) = \sum_{i=1}^{n} y^{(i)} \log \sigma(\theta^T x^{(i)}) + (1 - y^{(i)}) \log (1 - \sigma(\theta^T x^{(i)}))
\]

The negative of \(\log L(\theta) \) is also known as the cross-entropy loss, and minimizing it is equivalent to maximizing the likelihood.

Likelihood Is a Lens: Viewing Probability as a Function

The key philosophical shift is this:

- In probability, we ask: “Given θ, how likely is the data?”
- In likelihood, we ask: “Given the data, how likely is θ?”

So the same probability formula — say a binomial PMF or a normal PDF — can be reused as a likelihood function. The difference lies in which variables are considered known vs. unknown.

This makes any PMF or PDF a potential likelihood function, if you switch your point of view.

**When Is a Log of Probability Not a Log-Likelihood?**

Not all logs of probabilities are log-likelihoods.

If you simply say:

“The probability of rain is 0.7, so log(0.7) ≈ −0.357,”

You're computing a log-probability, not a log-likelihood, because you’re not treating it as a function of parameters.

Log-likelihood is a specific statistical function:
\[
\log L(\theta) = \log P(\text{data} \mid \theta)
\]

used when estimating parameters, not just describing the chance of an event.

Final Thoughts: Why Likelihood Matters

Likelihood is the cornerstone of statistical inference and machine learning. It provides a principled way to:

- Choose model parameters
- Evaluate competing models
- Understand the uncertainty in predictions

Understanding likelihood helps demystify how models learn, why we use loss functions like cross-entropy, and how "learning" is fundamentally about making the observed data more “likely” under the assumed model.

So next time you hear "log-likelihood," remember — it's not just a logged number. It’s a window into how well your model explains the world you've observed.

POST THOUGHTS 1.

Understanding Likelihood and MLE- Another Dimension

In statistical modeling, the concept of likelihood plays a central role in parameter estimation. While often confused with probability, likelihood has a distinct interpretation that underlies many modern machine learning and statistical inference methods.

Likelihood vs Probability

To begin with, consider a probability distribution function (PDF) or probability mass function (PMF) denoted as:

\[ P(x \mid \theta) \]

Here, \( x \) represents the data and \( \theta \) denotes the parameters of the model. When we interpret this expression as a function of \( x \) for fixed \( \theta \), we are dealing with probability. However, when the data \( x \) is fixed and we consider this expression as a function of \( \theta \), we obtain the likelihood function:

\[ L(\theta \mid x) = P(x \mid \theta) \]

Thus, likelihood is the same mathematical object as the probability distribution, but viewed differently: we treat the data as given and vary the parameters.

What Does MLE Do?

Maximum Likelihood Estimation (MLE) is a method for estimating the unknown parameters \( \theta \) of a statistical model. The goal is to find the parameter values that maximize the likelihood function. In other words, we search for the parameters that make the observed data most probable under our model.

Mathematically, MLE solves the following optimization problem:

\[ \hat{\theta} = \arg \max_{\theta} L(\theta \mid x) = \arg \max_{\theta} P(x \mid \theta) \]

Unlike simulation-based methods, MLE does not involve generating new data or matching empirical distributions. Instead, it uses the fixed observed data and finds the parameter values that explain this data best according to the assumed model.

A Subtle but Powerful Shift

This shift in perspective—from treating \( x \) as variable to treating \( \theta \) as variable—is subtle, but it is the foundation of likelihood-based inference. It allows us to treat the model itself as a hypothesis and test how well it explains what we observed.

Conclusion

To summarize:

Likelihood is the probability distribution function viewed as a function of parameters for fixed data.
In MLE, we adjust the parameters to maximize this likelihood.
The resulting parameter values, called Maximum Likelihood Estimates, are those under which the observed data is most probable.

This elegant idea underlies a large portion of classical statistics and modern machine learning, from logistic regression to deep learning loss functions.

My Research Notes

Friday, 16 May 2025