๐ Objective
To create a scalable, general-purpose vision model trained directly from natural language supervision, eliminating the need for task-specific labeled data.
๐ง Key Idea: Contrastive Learning with Image-Text Pairs
-
CLIP (Contrastive Language–Image Pre-training) is trained on 400 million (image, text) pairs.
-
Instead of classifying images into fixed labels, CLIP learns to associate images with their textual descriptions using a contrastive loss.
-
The goal: maximize similarity for correct image-text pairs and minimize it for incorrect ones.
๐งฐ Architecture
-
Image Encoder: ResNet (with modifications) or Vision Transformer (ViT).
-
Text Encoder: Transformer-based model (like GPT-style BPE with [EOS]).
-
Both encoders map inputs to a shared multi-modal embedding space.
-
Cosine similarity between image and text embeddings is used for training.
๐️ Training and Dataset
-
Dataset: 400 million (image, text) pairs scraped from the web (called WIT – WebImageText).
-
The model learns from these raw, weakly-labeled internet data sources.
-
Trained using a contrastive loss over all image-text combinations in a batch.
-
Scales well—largest ViT model trained on 256 GPUs for 12 days.
๐ Performance Highlights
-
Zero-shot capabilities: Once trained, CLIP can classify new images using text prompts without additional training.
-
On ImageNet, CLIP matches the accuracy of ResNet50 (76.2%) without using any training examples.
-
It performs well across 30+ diverse datasets: action recognition, OCR, fine-grained classification, etc.
-
CLIP also shows high robustness to distribution shifts and competitive few-shot performance.
๐งช Comparison to Baselines
-
Outperforms Visual N-Grams and weakly supervised models like ConVIRT.
-
Matches or exceeds the performance of few-shot classifiers with just zero-shot prompts.
-
ViT variants of CLIP are 3x more compute efficient than ResNet variants.
๐งญ Limitations
-
Doesn’t surpass state-of-the-art supervised models on all benchmarks.
-
Struggles on tasks like medical imaging, satellite image classification, and object counting.
-
Performance heavily depends on prompt engineering.
-
Bias and fairness concerns—susceptible to egregious misclassification if class labels are poorly designed.
๐ Broader Impact
-
Democratizes model usage—anyone can define a classifier in natural language.
-
Raises ethical concerns around bias, surveillance, and misuse due to ease of adaptation.
๐ Conclusion
CLIP is a powerful step toward general-purpose vision models that require no task-specific training. Its effectiveness across tasks shows that natural language can be used as a flexible and scalable supervision source for vision models.
No comments:
Post a Comment