These are NLP Vision Cross Over Models
1. CLIP (Contrastive Language-Image Pretraining)
By OpenAI (2021)
-
Goal: Connect images and natural language.
-
How: Trains an image encoder and a text encoder together using contrastive learning, so that matching image-text pairs are close in embedding space.
-
Input: An image and a caption.
-
Output: Embeddings that can be compared via cosine similarity.
-
Use: Zero-shot image classification, image search with text, cross-modal retrieval.
-
Impact: Enables tasks like “find all images that look like a red saree with gold border” without fine-tuning.
✅ CLIP — Multimodal Vision-Language Model
-
Type: Vision-language model using contrastive learning
-
Architecture:
-
Image Encoder: Can be a ResNet (CNN) or Vision Transformer (ViT)
-
Text Encoder: Usually a Transformer (like a simplified GPT)
-
-
Not just a vision model—it connects vision and language
2. DINO (Self-Distillation with No Labels)
By Facebook AI (2021)
-
Goal: Learn useful visual features without labels (self-supervised learning).
-
How: Uses a student-teacher model:
-
Both see differently augmented views of the same image.
-
The student learns to match the teacher’s output.
-
No labels are needed—just image augmentations.
-
-
Backbone: Often used with Vision Transformers (ViT).
-
Use: Can be used to initialize models for classification, detection, segmentation, etc., with fewer labels.
-
Impact: High-quality visual representations useful in downstream tasks.
✅ DINO — Self-Supervised Vision Model
-
Type: Self-supervised representation learning
-
Architecture:
-
Initially used ResNet (CNN), later popularized with ViT (Vision Transformer)
-
-
Not tied to classification directly, but trains models to learn rich visual features without labels
3. SAM (Segment Anything Model)
By Meta AI (2023)
-
Goal: General-purpose model for object segmentation.
-
How: Given a prompt (e.g., a point, box, or mask), SAM returns a segmentation mask.
-
Trained On: SA-1B—a massive dataset of over 1 billion masks.
-
Key Feature: Promptable segmentation—you can give it a point or box and get the corresponding object’s mask.
-
Use: Quickly segment any object in any image, even unseen ones.
-
Impact: One of the first models to enable zero-shot, prompt-based segmentation at scale.
No comments:
Post a Comment