Thursday, 15 May 2025

The CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision" by Alec Radford et al. (2021)

 See the FAQ here

๐Ÿ” Objective

To create a scalable, general-purpose vision model trained directly from natural language supervision, eliminating the need for task-specific labeled data.


๐Ÿง  Key Idea: Contrastive Learning with Image-Text Pairs

  • CLIP (Contrastive Language–Image Pre-training) is trained on 400 million (image, text) pairs.

  • Instead of classifying images into fixed labels, CLIP learns to associate images with their textual descriptions using a contrastive loss.

  • The goal: maximize similarity for correct image-text pairs and minimize it for incorrect ones.


๐Ÿงฐ Architecture

  • Image Encoder: ResNet (with modifications) or Vision Transformer (ViT).

  • Text Encoder: Transformer-based model (like GPT-style BPE with [EOS]).

  • Both encoders map inputs to a shared multi-modal embedding space.

  • Cosine similarity between image and text embeddings is used for training.


๐Ÿ‹️ Training and Dataset

  • Dataset: 400 million (image, text) pairs scraped from the web (called WIT – WebImageText).

  • The model learns from these raw, weakly-labeled internet data sources.

  • Trained using a contrastive loss over all image-text combinations in a batch.

  • Scales well—largest ViT model trained on 256 GPUs for 12 days.


๐Ÿ“ˆ Performance Highlights

  • Zero-shot capabilities: Once trained, CLIP can classify new images using text prompts without additional training.

  • On ImageNet, CLIP matches the accuracy of ResNet50 (76.2%) without using any training examples.

  • It performs well across 30+ diverse datasets: action recognition, OCR, fine-grained classification, etc.

  • CLIP also shows high robustness to distribution shifts and competitive few-shot performance.


๐Ÿงช Comparison to Baselines

  • Outperforms Visual N-Grams and weakly supervised models like ConVIRT.

  • Matches or exceeds the performance of few-shot classifiers with just zero-shot prompts.

  • ViT variants of CLIP are 3x more compute efficient than ResNet variants.


๐Ÿงญ Limitations

  • Doesn’t surpass state-of-the-art supervised models on all benchmarks.

  • Struggles on tasks like medical imaging, satellite image classification, and object counting.

  • Performance heavily depends on prompt engineering.

  • Bias and fairness concerns—susceptible to egregious misclassification if class labels are poorly designed.


๐ŸŒ Broader Impact

  • Democratizes model usage—anyone can define a classifier in natural language.

  • Raises ethical concerns around bias, surveillance, and misuse due to ease of adaptation.


๐Ÿ“Ž Conclusion

CLIP is a powerful step toward general-purpose vision models that require no task-specific training. Its effectiveness across tasks shows that natural language can be used as a flexible and scalable supervision source for vision models.

No comments:

Post a Comment

๐Ÿง  You Only Laugh Once: Creativity and Humor in Deep Learning Community

It all started with a simple truth: Attention Is All You Need . Or at least, that’s what the transformers keep whispering at every AI confer...