Friday, 5 June 2026

Understanding the Paper: An Introduction to Variational Autoencoders

Understanding the Paper: An Introduction to Variational Autoencoders

The paper “An Introduction to Variational Autoencoders” by Diederik P. Kingma and Max Welling is a tutorial-style introduction to Variational Autoencoders, commonly called VAEs. The paper explains how VAEs provide a principled framework for learning deep latent variable models and their corresponding inference models.

A VAE is a generative model. This means it tries to learn how data is generated, rather than only learning how to classify data. For example, instead of only learning whether an image belongs to class A or class B, a generative model tries to understand the hidden process that could have produced the image.

Core Idea: A Variational Autoencoder learns a compressed latent representation of data and also learns how to generate new data from that latent representation.

1. What Problem Is the Paper Solving?

The paper addresses a central problem in machine learning: how can we learn a model that understands the hidden structure behind observed data? In many real-world datasets, we only observe the final output, such as an image, a sound, a text, or a transaction. But we do not directly observe the hidden factors that generated it.

For example, in an image of a saree, we observe pixels. But behind those pixels there may be hidden factors such as motif, weave, yarn, border structure, camera angle, lighting, fabric fold, color family, and craft cluster. These hidden factors are called latent variables.

Observed Data Possible Hidden Latent Factors
Saree image Motif, weave, border, pallu, material, lighting, fold, camera angle.
Artwork image Style, genre, artist, period, subject, brushwork, medium.
Customer behavior Taste, price sensitivity, occasion, loyalty, urgency, preference pattern.

The VAE framework tries to learn these latent factors in a probabilistic way. It does not simply compress data like a normal autoencoder. It learns a probability distribution over the latent space and then learns to generate data from that space.

2. Generative vs Discriminative Modeling

The paper begins by distinguishing between discriminative modeling and generative modeling.

Type of Model Main Question Example
Discriminative model Given the input, what is the label? Given an image, classify whether it is Banarasi or Kanjivaram.
Generative model How could this data have been generated? Learn the hidden factors that can generate saree-like images.

A discriminative model learns:

\[ p_{\theta}(y \mid x) \]

This means it learns the probability of label \(y\) given input \(x\).

A generative model learns:

\[ p_{\theta}(x) \]

or, when hidden variables are included:

\[ p_{\theta}(x,z) \]

Here, \(x\) is the observed data and \(z\) is the latent variable. The generative model tries to explain how \(x\) could arise from hidden causes \(z\).

Simple Explanation: A discriminative model asks, “What class is this?” A generative model asks, “What hidden process could have produced this?”

3. Latent Variables and Deep Latent Variable Models

A latent variable is a variable that is part of the model but is not directly observed in the dataset. In a VAE, the latent variable is usually written as:

\[ z \]

The observed data is written as:

\[ x \]

The model assumes that data is generated through a process like this:

\[ z \rightarrow x \]

First, a latent variable \(z\) is sampled from a prior distribution. Then the observed data \(x\) is generated from \(z\). The joint distribution is written as:

\[ p_{\theta}(x,z) = p_{\theta}(z)p_{\theta}(x \mid z) \]

Term Meaning
\(p_{\theta}(z)\) Prior distribution over latent variables.
\(p_{\theta}(x \mid z)\) Decoder or generative model that creates data from latent variables.
\(p_{\theta}(x,z)\) Joint distribution over observed data and latent variables.

A deep latent variable model is a latent variable model where one or more probability distributions are parameterized using neural networks. This gives the model high flexibility.

4. What Is a Variational Autoencoder?

A Variational Autoencoder is a model that combines three ideas:

Idea Meaning in VAE
Autoencoder It learns to compress data into a latent representation and reconstruct it.
Variational inference It approximates an intractable posterior distribution using a simpler distribution.
Generative modeling It learns how to generate new data from the latent space.

The VAE has two main parts:

\[ Encoder: x \rightarrow z \]

\[ Decoder: z \rightarrow x \]

But unlike an ordinary autoencoder, the encoder does not output a single fixed vector. It outputs the parameters of a probability distribution over latent variables. The decoder then samples from this distribution and reconstructs or generates data.

5. Encoder and Decoder

5.1 Encoder, or Approximate Posterior

The encoder is also called the recognition model or inference model. Its job is to approximate the true posterior distribution:

\[ p_{\theta}(z \mid x) \]

However, this true posterior is usually intractable. So the VAE introduces an approximate posterior:

\[ q_{\phi}(z \mid x) \]

Here, \(\phi\) represents the encoder parameters. The encoder looks at \(x\) and predicts a distribution over possible latent variables \(z\).

5.2 Decoder, or Generative Model

The decoder learns:

\[ p_{\theta}(x \mid z) \]

Here, \(\theta\) represents the decoder parameters. The decoder takes a latent sample \(z\) and generates or reconstructs \(x\).

VAE Component Mathematical Form Simple Meaning
Encoder \(q_{\phi}(z \mid x)\) Given data, infer the latent cause.
Decoder \(p_{\theta}(x \mid z)\) Given latent cause, generate data.
Prior \(p(z)\) Assumed distribution over latent space.

6. Why Direct Learning Is Difficult

The ideal goal is to maximize the likelihood of observed data:

\[ p_{\theta}(x) \]

But in a latent variable model:

\[ p_{\theta}(x) = \int p_{\theta}(x,z)\,dz \]

This integral is usually difficult or impossible to compute exactly when neural networks are involved. This is the main intractability problem.

The posterior is also difficult:

\[ p_{\theta}(z \mid x) = \frac{p_{\theta}(x,z)}{p_{\theta}(x)} \]

Since \(p_{\theta}(x)\) is intractable, the posterior \(p_{\theta}(z \mid x)\) also becomes intractable.

Why VAE Is Needed: The VAE gives us a practical way to train deep latent variable models even when the true likelihood and posterior are difficult to compute directly.

7. Evidence Lower Bound, or ELBO

The central mathematical idea in VAEs is the Evidence Lower Bound, or ELBO. Instead of directly maximizing the intractable log-likelihood:

\[ \log p_{\theta}(x) \]

the VAE maximizes a lower bound on it:

\[ \mathcal{L}(\theta,\phi;x) = \mathbb{E}_{q_{\phi}(z \mid x)} [ \log p_{\theta}(x \mid z) ] - D_{KL} \left( q_{\phi}(z \mid x) \parallel p_{\theta}(z) \right) \]

This equation has two important parts:

ELBO Term Meaning Intuition
\[ \mathbb{E}_{q_{\phi}(z \mid x)}[\log p_{\theta}(x \mid z)] \] Reconstruction term The decoder should reconstruct the input well from the latent variable.
\[ D_{KL}(q_{\phi}(z \mid x) \parallel p_{\theta}(z)) \] Regularization term The encoder’s latent distribution should stay close to the prior.

The ELBO can be understood as:

\[ ELBO = Reconstruction\ Quality - Latent\ Space\ Penalty \]

The reconstruction term encourages the model to preserve information about the input. The KL divergence term prevents the latent space from becoming irregular or overfitted.

8. Reparameterization Trick

A key contribution of the VAE framework is the reparameterization trick. The difficulty is that the encoder samples:

\[ z \sim q_{\phi}(z \mid x) \]

Sampling is normally not easy to differentiate through. But neural networks are trained using backpropagation, which requires gradients.

The reparameterization trick rewrites the random sampling process as a deterministic transformation of noise:

\[ z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon \]

where:

\[ \epsilon \sim \mathcal{N}(0,I) \]

Symbol Meaning
\(\mu_{\phi}(x)\) Mean predicted by the encoder.
\(\sigma_{\phi}(x)\) Standard deviation predicted by the encoder.
\(\epsilon\) Random noise sampled from a standard normal distribution.
\(\odot\) Element-wise multiplication.

This trick separates the randomness from the learnable parameters. As a result, gradients can flow through \(\mu_{\phi}(x)\) and \(\sigma_{\phi}(x)\), making stochastic gradient optimization possible.

Simple Explanation: Instead of sampling \(z\) directly from the encoder distribution, we sample noise separately and then transform it using the encoder’s mean and standard deviation.

9. Factorized Gaussian Posterior

A common choice in VAEs is to use a factorized Gaussian posterior:

\[ q_{\phi}(z \mid x) = \mathcal{N} \left( z; \mu_{\phi}(x), diag(\sigma_{\phi}^{2}(x)) \right) \]

This means the encoder predicts a mean vector and a diagonal covariance matrix. The diagonal covariance assumption makes computation simpler because each latent dimension is treated as conditionally independent given \(x\).

The prior is usually chosen as a standard normal distribution:

\[ p(z) = \mathcal{N}(z;0,I) \]

This gives the latent space a smooth and regular structure. New samples can be generated by sampling:

\[ z \sim \mathcal{N}(0,I) \]

and then passing \(z\) through the decoder.

10. How a VAE Is Trained

The training process of a VAE can be summarized as follows:

Step What Happens
Step 1 Input data \(x\) is passed into the encoder.
Step 2 The encoder predicts \(\mu_{\phi}(x)\) and \(\sigma_{\phi}(x)\).
Step 3 A latent sample \(z\) is created using the reparameterization trick.
Step 4 The decoder generates or reconstructs \(x\) from \(z\).
Step 5 The model maximizes the ELBO, balancing reconstruction and regularization.

The training objective is:

\[ \max_{\theta,\phi} \mathcal{L}(\theta,\phi;x) \]

or equivalently, minimizing the negative ELBO:

\[ \min_{\theta,\phi} - \mathcal{L}(\theta,\phi;x) \]

In practice, this is optimized using stochastic gradient descent or variants such as Adam.

11. Extensions Beyond Basic VAEs

The paper also discusses important extensions beyond the basic VAE.

11.1 Beyond Gaussian Posteriors

A simple Gaussian posterior may be too limited. The true posterior may be complex, curved, or multimodal. To make the approximate posterior more flexible, researchers use transformations such as normalizing flows.

A flow transforms a simple random variable into a more complex one:

\[ z_K = f_K \circ f_{K-1} \circ \cdots \circ f_1(z_0) \]

This allows the posterior distribution to become more expressive while still remaining computationally tractable.

11.2 Inverse Autoregressive Flow

The paper discusses Inverse Autoregressive Flow, or IAF, as one method for improving posterior flexibility. IAF allows the encoder distribution to move beyond a simple diagonal Gaussian and better approximate complex posterior shapes.

11.3 Deeper Generative Models

The paper also discusses deeper generative models with multiple latent variables:

\[ p_{\theta}(x,z_1,z_2,\ldots,z_L) \]

Such models can represent hierarchical structure in data. For example, in images, higher-level latent variables may represent broad structure, while lower-level variables may represent local details.

12. Strengths of VAEs

The first strength of VAEs is that they provide a principled probabilistic framework for learning latent representations. Unlike ordinary autoencoders, VAEs explicitly model uncertainty.

The second strength is that VAEs can generate new samples. Once trained, the decoder can generate data by sampling from the latent prior:

\[ z \sim p(z), \quad x \sim p_{\theta}(x \mid z) \]

The third strength is that VAEs support representation learning. The latent space may capture meaningful factors of variation in the data.

The fourth strength is that VAEs are trainable using stochastic gradient descent, making them scalable to large datasets and neural network architectures.

13. Limitations and Challenges

VAEs also have limitations. One major challenge is that the approximate posterior \(q_{\phi}(z \mid x)\) may be too simple. A diagonal Gaussian may not capture a complex true posterior.

Another challenge is that generated samples from VAEs may sometimes look blurry, especially in image generation. This happens because likelihood-based reconstruction objectives often encourage averaging over possible outputs.

A third challenge is balancing reconstruction quality and latent regularization. If the KL term is too strong, the latent space may become too close to the prior and fail to encode useful information. This problem is often called posterior collapse.

A fourth challenge is interpretability. Although VAEs are often used for representation learning, the learned latent dimensions do not automatically correspond to clean human-interpretable concepts unless additional constraints or objectives are introduced.

14. Connection with Saree and Textile Research

VAEs can be highly relevant for textile and saree research because saree images contain many hidden factors of variation. A saree image is not only a picture. It carries hidden structural and cultural information.

Latent Factor Saree Example
Motif structure Peacock, mango, floral buta, temple, parrot.
Layout Body, border, pallu, selvedge arrangement.
Material impression Silk, cotton, tussar, zari-rich surface.
Craft identity Kanchipuram, Banaras, Paithani, Gadwal, Ilkal.
Image condition Lighting, fold, drape, camera angle, background.

A VAE trained on saree images could learn a latent space where visually and structurally similar sarees are placed close together. This could support:

Application How VAE Could Help
Image retrieval Find sarees with similar visual or structural patterns.
Data augmentation Generate controlled variations of saree images.
Representation learning Learn useful embeddings before classification.
Anomaly detection Identify unusual or out-of-distribution saree images.
Design exploration Explore smooth transitions between motif, color, or layout styles.

For saree provenance classification, a VAE may not be sufficient by itself because provenance depends on cultural and technical knowledge. However, it can be useful as a representation-learning tool. The learned latent vectors can be combined with CNNs, Vision Transformers, or knowledge graph models.

For example:

\[ Saree\ Image \rightarrow VAE\ Encoder \rightarrow Latent\ Representation \rightarrow Classifier \rightarrow Craft\ Cluster \]

Or, in a richer model:

\[ Visual\ Latent\ Representation + Textile\ Knowledge\ Graph \rightarrow Saree\ Provenance\ Prediction \]

This makes VAEs relevant not as a final classification solution alone, but as a powerful way to learn hidden structure in saree image datasets.

15. One-Sentence Summary

The paper explains Variational Autoencoders as a principled framework for learning deep latent variable models, where an encoder approximates the posterior over hidden variables, a decoder generates data from those variables, and the whole model is trained by maximizing the Evidence Lower Bound using the reparameterization trick.

General Disclaimer: This explanation is intended for educational and conceptual understanding. It simplifies some technical details of the original paper while preserving the main ideas, equations, architecture, training objective, extensions, and practical implications.

No comments:

Post a Comment

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding The paper “DRISHTIKON: A Multimodal Multilingual Benchm...