Sunday, 11 May 2025

NLP Vision Crossover Models: CLIP, DINO and SAM

 These are NLP Vision Cross Over Models

1. CLIP (Contrastive Language-Image Pretraining)

By OpenAI (2021)

  • Goal: Connect images and natural language.

  • How: Trains an image encoder and a text encoder together using contrastive learning, so that matching image-text pairs are close in embedding space.

  • Input: An image and a caption.

  • Output: Embeddings that can be compared via cosine similarity.

  • Use: Zero-shot image classification, image search with text, cross-modal retrieval.

  • Impact: Enables tasks like “find all images that look like a red saree with gold border” without fine-tuning.


CLIPMultimodal Vision-Language Model

  • Type: Vision-language model using contrastive learning

  • Architecture:

    • Image Encoder: Can be a ResNet (CNN) or Vision Transformer (ViT)

    • Text Encoder: Usually a Transformer (like a simplified GPT)

  • Not just a vision model—it connects vision and language


2. DINO (Self-Distillation with No Labels)

By Facebook AI (2021)

  • Goal: Learn useful visual features without labels (self-supervised learning).

  • How: Uses a student-teacher model:

    • Both see differently augmented views of the same image.

    • The student learns to match the teacher’s output.

    • No labels are needed—just image augmentations.

  • Backbone: Often used with Vision Transformers (ViT).

  • Use: Can be used to initialize models for classification, detection, segmentation, etc., with fewer labels.

  • Impact: High-quality visual representations useful in downstream tasks.


DINOSelf-Supervised Vision Model

  • Type: Self-supervised representation learning

  • Architecture:

    • Initially used ResNet (CNN), later popularized with ViT (Vision Transformer)

  • Not tied to classification directly, but trains models to learn rich visual features without labels


3. SAM (Segment Anything Model)

By Meta AI (2023)

  • Goal: General-purpose model for object segmentation.

  • How: Given a prompt (e.g., a point, box, or mask), SAM returns a segmentation mask.

  • Trained On: SA-1B—a massive dataset of over 1 billion masks.

  • Key Feature: Promptable segmentation—you can give it a point or box and get the corresponding object’s mask.

  • Use: Quickly segment any object in any image, even unseen ones.

  • Impact: One of the first models to enable zero-shot, prompt-based segmentation at scale.


SAMPromptable Segmentation Model

  • Type: Vision model focused on segmentation

  • Architecture:

    • Image Encoder: Vision Transformer (ViT)

    • Prompt Encoder: Encodes input prompts like points, boxes

    • Mask Decoder: Generates segmentation masks

  • Heavily vision-specific, but flexible and prompt-driven



So in short:

  • CLIP is multimodal, can use CNNs or Transformers.

  • DINO is a vision model for learning features, typically uses ViT.

  • SAM is a segmentation model using a ViT backbone and prompt-based architecture.

Would you like help choosing which model fits your saree classification problem?


No comments:

Post a Comment

🧠 You Only Laugh Once: Creativity and Humor in Deep Learning Community

It all started with a simple truth: Attention Is All You Need . Or at least, that’s what the transformers keep whispering at every AI confer...