My Research Notes: NLP Vision Crossover Models: CLIP, DINO and SAM

Sunday, 11 May 2025

These are NLP Vision Cross Over Models

By OpenAI (2021)

Goal: Connect images and natural language.
How: Trains an image encoder and a text encoder together using contrastive learning, so that matching image-text pairs are close in embedding space.
Input: An image and a caption.
Output: Embeddings that can be compared via cosine similarity.
Use: Zero-shot image classification, image search with text, cross-modal retrieval.
Impact: Enables tasks like “find all images that look like a red saree with gold border” without fine-tuning.

Type: Vision-language model using contrastive learning
Architecture:
- Image Encoder: Can be a ResNet (CNN) or Vision Transformer (ViT)
- Text Encoder: Usually a Transformer (like a simplified GPT)
Not just a vision model—it connects vision and language

By Facebook AI (2021)

Goal: Learn useful visual features without labels (self-supervised learning).
How: Uses a student-teacher model:
- Both see differently augmented views of the same image.
- The student learns to match the teacher’s output.
- No labels are needed—just image augmentations.
Backbone: Often used with Vision Transformers (ViT).
Use: Can be used to initialize models for classification, detection, segmentation, etc., with fewer labels.
Impact: High-quality visual representations useful in downstream tasks.

Type: Self-supervised representation learning
Architecture:
- Initially used ResNet (CNN), later popularized with ViT (Vision Transformer)
Not tied to classification directly, but trains models to learn rich visual features without labels

By Meta AI (2023)

Goal: General-purpose model for object segmentation.
How: Given a prompt (e.g., a point, box, or mask), SAM returns a segmentation mask.
Trained On: SA-1B—a massive dataset of over 1 billion masks.
Key Feature: Promptable segmentation—you can give it a point or box and get the corresponding object’s mask.
Use: Quickly segment any object in any image, even unseen ones.
Impact: One of the first models to enable zero-shot, prompt-based segmentation at scale.

Type: Vision model focused on segmentation
Architecture:
- Image Encoder: Vision Transformer (ViT)
- Prompt Encoder: Encodes input prompts like points, boxes
- Mask Decoder: Generates segmentation masks
Heavily vision-specific, but flexible and prompt-driven

So in short:

CLIP is multimodal, can use CNNs or Transformers.
DINO is a vision model for learning features, typically uses ViT.
SAM is a segmentation model using a ViT backbone and prompt-based architecture.

Would you like help choosing which model fits your saree classification problem?

My Research Notes