Saturday, 6 June 2026

Understanding Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Understanding Vision Transformers

The paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al. introduced one of the most influential ideas in modern computer vision: the Vision Transformer, commonly called ViT. The central idea of the paper is simple but powerful. Instead of processing an image through convolutional filters, the image is divided into small patches, and these patches are treated like words in a sentence.

This idea changed the way researchers thought about image recognition. Earlier, convolutional neural networks dominated computer vision because they naturally understood local image structure. ViT challenged this assumption by showing that a nearly standard Transformer, when trained at sufficient scale, can perform extremely well on image classification tasks.

1. Core Idea of the Paper

The main argument of the paper is that an image can be converted into a sequence and then processed by a Transformer, just as text is processed in natural language processing. In NLP, a sentence is broken into tokens. In ViT, an image is broken into patches.

For example, an image of size \(224 \times 224\) can be divided into patches of size \(16 \times 16\). Each patch is flattened into a vector, projected into an embedding space, given positional information, and then passed into a Transformer encoder.

The phrase “An Image is Worth 16x16 Words” comes from this idea. Each \(16 \times 16\) image patch behaves like one visual word. The image becomes a sentence made of visual tokens.

Simple intuition: A CNN looks at an image through local filters. ViT looks at an image as a sequence of patches and learns relationships between patches using self-attention.

2. Why This Paper Was Important

Before this paper, computer vision was dominated by convolutional neural networks such as AlexNet, VGG, ResNet, and EfficientNet. CNNs were considered natural for images because they used locality and translation equivariance. These properties are built into the architecture itself.

Transformers, on the other hand, were dominant in NLP but not yet mainstream in image recognition. Many earlier approaches tried to combine CNNs with attention or use attention only in selected parts of the vision pipeline. This paper asked a bold question: Can we remove convolutions almost entirely and use a plain Transformer directly for image classification?

The answer was yes, but with an important condition. ViT works extremely well when trained on very large datasets. When trained only on smaller datasets like ImageNet-1k, it does not always outperform strong CNNs. But when pre-trained on large datasets such as ImageNet-21k or JFT-300M and then fine-tuned, it becomes highly competitive and often superior.

3. Vision Transformer Architecture

The Vision Transformer follows the original Transformer encoder architecture quite closely. The authors intentionally kept the design simple so that existing Transformer ideas from NLP could be reused for vision.

The architecture has the following main steps:

  1. The image is split into fixed-size patches.
  2. Each patch is flattened into a vector.
  3. A linear projection converts each patch vector into an embedding.
  4. A learnable class token is added to the beginning of the sequence.
  5. Position embeddings are added so that the model knows where each patch came from.
  6. The sequence is passed through a Transformer encoder.
  7. The final output corresponding to the class token is used for image classification.
Stage What Happens Purpose
Patch extraction The image is divided into fixed-size patches such as \(16 \times 16\). Converts a 2D image into a sequence-like structure.
Linear projection Each flattened patch is projected into a vector of dimension \(D\). Creates patch embeddings similar to word embeddings.
Class token A learnable token is placed at the start of the sequence. Acts as the final image-level representation for classification.
Position embedding Position information is added to each patch embedding. Helps the model understand patch location.
Transformer encoder The patch sequence is processed using self-attention and MLP layers. Learns relationships between image patches.
Classification head The class token output is passed to a classifier. Produces the final image class prediction.

4. How Images Become Patches

Suppose the input image is:

\[ x \in \mathbb{R}^{H \times W \times C} \]

Here, \(H\) is image height, \(W\) is image width, and \(C\) is the number of channels. For an RGB image, \(C = 3\).

The image is divided into patches of size:

\[ P \times P \]

The number of patches becomes:

\[ N = \frac{HW}{P^2} \]

For example, if the image size is \(224 \times 224\) and patch size is \(16 \times 16\), then:

\[ N = \frac{224 \times 224}{16^2} = 196 \]

So, the image becomes a sequence of 196 patch tokens. After adding one class token, the total sequence length becomes 197.

Text analogy: In BERT, a sentence becomes a sequence of word tokens. In ViT, an image becomes a sequence of patch tokens.

5. Important Equations

The paper describes the ViT input sequence as follows:

\[ z_0 = [x_{class}; x_p^1E; x_p^2E; \cdots; x_p^NE] + E_{pos} \]

Here, \(x_{class}\) is the learnable class token, \(x_p^i\) is the \(i^{th}\) image patch, \(E\) is the trainable linear projection matrix, and \(E_{pos}\) is the position embedding.

The Transformer encoder then applies multi-head self-attention and MLP blocks repeatedly:

\[ z'_{\ell} = MSA(LN(z_{\ell-1})) + z_{\ell-1} \]

\[ z_{\ell} = MLP(LN(z'_{\ell})) + z'_{\ell} \]

The final image representation is:

\[ y = LN(z_L^0) \]

The most important idea behind these equations is that each patch token can attend to other patch tokens. This allows the model to learn global image relationships from the beginning, unlike CNNs where the receptive field grows gradually through deeper layers.

Self-Attention Equation

Self-attention is based on queries, keys, and values:

\[ A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \]

\[ \text{Attention}(Q,K,V) = AV \]

This means the model calculates how much each patch should attend to every other patch. If one patch contains a motif and another patch contains a related border or texture, self-attention can learn that relationship directly.

6. Inductive Bias: CNNs vs ViT

A key discussion in the paper is the difference between CNNs and Vision Transformers in terms of inductive bias. Inductive bias means the assumptions built into a model before training begins.

CNNs have strong image-specific inductive biases. They assume that nearby pixels are related, that patterns can appear in different parts of the image, and that local features are important. These assumptions are very useful when data is limited.

ViT has much less image-specific inductive bias. Apart from splitting the image into patches and adding positional embeddings, it does not strongly assume locality or translation equivariance. This makes ViT more flexible, but also more data-hungry.

Aspect CNN Vision Transformer
Basic unit Pixels and local receptive fields Image patches treated as tokens
Core operation Convolution Self-attention
Locality Strongly built in Weakly built in through patches
Global relationship learning Usually emerges in deeper layers Possible from early layers
Data requirement Works well with comparatively smaller datasets Needs large-scale pre-training for best performance
Interpretation Learns local filters and hierarchical features Learns relationships between patch tokens

7. Experiments and Results

The authors evaluated ViT on several datasets including ImageNet, ImageNet-ReaL, CIFAR-10, CIFAR-100, Oxford-IIIT Pets, Oxford Flowers-102, and VTAB. They compared ViT against strong convolutional baselines such as BiT and Noisy Student.

One of the major findings was that ViT performed extremely well when pre-trained on large datasets. The best model, ViT-H/14 pre-trained on JFT-300M, achieved very strong results across multiple benchmarks.

Dataset ViT-H/14 Performance Why It Matters
ImageNet 88.55% Shows strong performance on a standard image recognition benchmark.
ImageNet-ReaL 90.72% Shows robustness on cleaned-up ImageNet labels.
CIFAR-100 94.55% Shows strong transfer to a smaller classification dataset.
Oxford-IIIT Pets 97.56% Shows good fine-grained recognition capability.
VTAB 77.63% Shows generalization across varied visual tasks.

The paper also showed that ViT can be more computationally efficient than comparable CNN-based models when trained at scale. In other words, large Transformers can use pre-training compute very effectively.

Data Scale Matters

One of the most important conclusions of the paper is:

Large-scale training can compensate for weaker image-specific inductive bias.

When ViT is trained on smaller datasets, CNNs often perform better because their architectural assumptions are useful. But when ViT is trained on very large datasets, it can learn visual structure from data and outperform CNNs.

8. How ViT Looks at Images

The authors also inspected what ViT learns internally. They found that the model learns meaningful patch embeddings and positional relationships. Even though the position embeddings are one-dimensional, the model learns patterns that reflect the two-dimensional structure of images.

Another important observation is that some attention heads attend globally from early layers, while others focus locally. This means ViT can learn both local and global relationships depending on the attention head and layer.

This is very useful for visual recognition. A model may need to attend to a small texture in one part of the image and connect it with a larger pattern elsewhere. In textiles, this can be especially relevant because motif, border, pallu, weave texture, and color layout may all contribute to classification.

9. Limitations and Challenges

Although ViT was a breakthrough, the paper also makes its limitations clear. ViT is not automatically better than CNNs in all situations. Its performance depends strongly on pre-training scale.

The first limitation is the need for large datasets. CNNs can perform strongly on smaller datasets because their architecture already encodes useful assumptions about images. ViT needs to learn many of these relationships from data.

The second limitation is computational cost. Although ViT may be efficient at scale, training large Transformer models still requires significant hardware and careful optimization.

The third limitation is that the original paper focuses mainly on image classification. Tasks such as detection, segmentation, and localization require additional adaptation.

Limitation Explanation Practical Meaning
Needs large-scale pre-training ViT lacks strong image-specific inductive bias. It may underperform on small datasets if trained from scratch.
High compute requirement Large models and large datasets require substantial resources. Fine-tuning pre-trained models is often more practical than training from scratch.
Patch-level representation Very small details may be affected by patch size. Fine-grained domains may need careful patch-size selection.
Mainly classification-focused The original work primarily evaluated recognition benchmarks. Other vision tasks need architectural or training modifications.

10. Relevance for Textile and Saree Image Classification

This paper is highly relevant for textile image analysis and saree provenance classification. Traditional saree identification often depends on relationships between multiple visual cues: border structure, pallu design, body motifs, weave texture, color placement, zari layout, and regional design grammar.

A CNN is strong at detecting local texture and motif patterns. However, a Vision Transformer can potentially model relationships between distant regions of the saree image. For example, the border may appear on one side, the pallu in another region, and the body motif across the center. Self-attention allows these distant visual regions to communicate with each other.

In saree classification, this is important because provenance is not always determined by a single local feature. A saree may require relational visual reasoning: how the motif relates to the border, how the pallu relates to the body, how the weave structure supports the regional identity, and how color placement follows a traditional craft grammar.

ViT Concept Possible Textile Interpretation
Patch tokens Small visual regions of fabric, motif, texture, border, or pallu.
Self-attention Relationship between distant textile features.
Position embeddings Location of design elements within the saree image.
Class token Overall representation used to classify the saree origin or type.
Large-scale pre-training Useful when adapting general vision models to textile datasets.

For a saree-origin classification project, ViT can be considered in three ways. First, it can be used as a pre-trained feature extractor. Second, it can be fine-tuned on labeled saree images. Third, it can be combined with CNNs or graph-based models to capture both local texture and broader relational structure.

11. Simple Summary

The Vision Transformer paper showed that images do not always need to be processed by convolutional networks. An image can be split into patches, converted into a sequence, and processed by a standard Transformer encoder.

The key insight is that self-attention allows the model to learn relationships between image patches. This makes ViT especially powerful when trained on large datasets. However, because ViT has less built-in image-specific bias than CNNs, it usually needs more data or strong pre-training.

For textile and saree image classification, ViT is important because many textile identities are based not only on local motifs but also on the relationship between different parts of the image. This makes Vision Transformers a valuable model family for future research in fine-grained textile classification.

12. General Disclaimer

This article is an educational explanation of the research paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. It is intended for conceptual understanding and academic learning. The explanations simplify some technical details for readability. Readers interested in implementation, exact experimental settings, and complete mathematical details should refer to the original paper.

```

No comments:

Post a Comment

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding The paper “DRISHTIKON: A Multimodal Multilingual Benchm...