My Research Notes: The Vision Transformer Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Sunday, 11 May 2025

The Vision Transformer Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

🔍 Key Idea:

Instead of using convolutional neural networks (CNNs), the authors apply a standard Transformer architecture to sequences of fixed-size image patches, treating them like word tokens in NLP. This model, Vision Transformer (ViT), processes image data without relying on inductive biases like translation equivariance or locality inherent in CNNs.

🧠 Model Overview:

Patch Embedding: Images are divided into 16×16 patches, flattened, and linearly projected to embeddings.
Positional Embeddings: Added to preserve spatial information.
Transformer Encoder: A standard encoder processes the patch embeddings.
Classification Token: A special token is prepended; its output is used for classification.
Variants: ViT-Base, ViT-Large, ViT-Huge, differing in layers and parameters.

⚙️ Training and Performance:

Pretraining Data: Large-scale datasets like ImageNet-21k and JFT-300M are crucial. ViT underperforms on smaller datasets due to lack of inductive biases.
Fine-tuning: Pretrained ViT models transfer well to smaller datasets like CIFAR-100, Oxford Pets, etc.
Self-Supervision: Preliminary experiments with masked patch prediction (like BERT) show promise but lag behind supervised learning.

📊 Results:

State-of-the-art Accuracy on many benchmarks (ImageNet, CIFAR-100, VTAB).
Faster Training: Requires significantly fewer TPU-days compared to CNN-based models like ResNet and EfficientNet.
Scalability: Performance improves with more data and model size.

🔬 Insights:

Global Attention Early: Some attention heads integrate global information even in shallow layers.
Learned 2D structure: Positional embeddings capture image topology implicitly.
Less bias, more data: ViT's success hinges on large data rather than CNN-like architectural biases.

🎯 Conclusion:

Vision Transformer shows that pure Transformer architectures can outperform CNNs for image classification when trained at scale. It opens a new direction for computer vision, potentially simplifying architectures and unifying vision and NLP models under the Transformer framework.

My Research Notes