Monday, 21 April 2025

R CNN, Fast RCNN and Faster RCNN

 

🧠 What is R-CNN?

R-CNN (proposed by Ross Girshick et al. in 2014) was a landmark model that combined:

  • Region proposal methods (to suggest "where" objects might be),

  • With Convolutional Neural Networks (CNNs) (to learn "what" those regions contain),

  • To perform object detection — classifying and localizing objects in images.

🔁 Before R-CNN, traditional detection methods (like DPMs) used hand-crafted features (e.g., HOG). R-CNN was one of the first to apply deep learning to detection.


🧱 Architecture Stack of R-CNN

Here’s how R-CNN’s architecture is structured step-by-step:

🧩 1. Input Image → Region Proposals

  • R-CNN starts with the input image.

  • Uses a classical algorithm (usually Selective Search) to generate about 2000 region proposals (bounding boxes likely to contain objects).

📌 These proposals are class-agnostic — they just suggest “something might be here.”


🧩 2. Warping Region Proposals

  • Each region proposal is cropped and resized (typically to 227×227 pixels).

  • This resizing ensures compatibility with the input size required by the CNN.


🧠 3. CNN Feature Extraction

  • Each warped region is passed through a pre-trained CNN (like AlexNet or VGG).

  • The CNN outputs a feature vector (often from the fc7 layer, a fully connected layer).

📌 This stage is the deep learning heart of R-CNN.


📈 4. Region Classification (SVM)

  • For each class, a separate linear SVM is trained using the extracted features.

  • Each region is scored by these SVMs to determine which object class (if any) it contains.

📌 CNN acts as a feature extractor, while SVMs do the classification.


📐 5. Bounding Box Regression

  • To refine the position of bounding boxes, a bounding box regressor is trained.

  • It predicts adjustments to make the detected box align better with the ground truth.


🔄 R-CNN Pipeline Summary

css
[Input Image][Selective Search][2000 Proposals][Warp each to 227x227][CNN (AlexNet/VGG)][Feature Vector][SVM Classifier] + [Bounding Box Regressor]

📉 Limitations of R-CNN

While it was a breakthrough, R-CNN had limitations:

  1. Very slow: Each region proposal is passed through the CNN separately → expensive!

  2. Multi-stage pipeline: CNN + SVM + regressor are trained separately.

  3. Heavy storage: Extracted features need to be stored on disk for training SVMs.


🧬 Evolution After R-CNN

To overcome these problems, later versions were developed:

  • Fast R-CNN: Uses a single CNN pass for the entire image.

  • Faster R-CNN: Adds a Region Proposal Network (RPN) to make the entire process end-to-end.

  • Mask R-CNN: Adds segmentation capabilities on top of Faster R-CNN.


⚡️ What is Fast R-CNN?

Fast R-CNN (Girshick, 2015) is a refined and faster version of R-CNN. It avoids redundant CNN computations for each region proposal by:

✅ Running the CNN once per image,
✅ Using Region of Interest (RoI) pooling to extract features for each proposal, and
✅ Training everything (classification + bounding box regression) in a single end-to-end network.


🧱 Architecture Stack of Fast R-CNN

Let’s break down the architecture and flow:


1️⃣ Input Image → CNN Feature Map

  • The entire image is passed once through a deep CNN (like VGG16 or ResNet).

  • This results in a convolutional feature map representing the entire image.

📌 Unlike R-CNN, Fast R-CNN avoids repeating this step for every proposal.


2️⃣ Region Proposals (RoIs)

  • Region proposals (still from Selective Search in this version) are generated.

  • Each proposal is a bounding box over the image.


3️⃣ RoI Pooling

  • Each region proposal is mapped onto the feature map.

  • A RoI Pooling layer extracts a fixed-size feature vector (e.g., 7×7) from each proposal's region on the feature map.

📌 This makes it possible to feed different-sized regions into fully connected layers.


4️⃣ Fully Connected Layers (FC layers)

  • The pooled feature vectors from RoI pooling are fed into shared FC layers.

  • This generates high-level features for each RoI.


5️⃣ Two Output Heads (for each RoI)

Each region now branches into two outputs:

✅ a. Softmax Classifier

  • Predicts the class label for each region proposal.

  • Includes a “background” class for non-object regions.

✅ b. Bounding Box Regressor

  • Predicts refined coordinates for the bounding box.


📊 Fast R-CNN Architecture Flow

scss
[Input Image][CNN (e.g., VGG16)][Feature Map] + [Region Proposals (RoIs from Selective Search)][RoI Pooling (for each proposal)][Fully Connected Layers][Softmax Classifier] (object class) → [Bounding Box Regressor] (refinement)

🎯 Key Improvements Over R-CNN

FeatureR-CNNFast R-CNN
CNN run per proposalYes (2000× per image)No (1× per image)
Feature reuse❌ No✅ Yes (shared feature map)
Training pipeline3-stageEnd-to-end
Detection speedSlow (∼47s/image)Fast (∼0.3s/image with VGG16)
Storage requirementHigh (features saved to disk)Low (end-to-end in memory)

🧩 Limitations of Fast R-CNN

  • Still relies on Selective Search for region proposals, which is slow and external to the network.

  • This bottleneck was solved in the next evolution: Faster R-CNN.


🚀 What is Faster R-CNN?

Faster R-CNN (by Ren, He, Girshick, and Sun, 2015) is an evolution of Fast R-CNN. It solves the biggest bottleneck of Fast R-CNN: slow, external region proposal generation (e.g., Selective Search).

The breakthrough? It introduces a learnable, fully convolutional Region Proposal Network (RPN) inside the CNN itself.

📌 Now, both region proposal generation and object detection are done in a single, unified, deep learning model.


🧱 Architecture Stack of Faster R-CNN

Let’s go step by step through how the architecture is stacked:


1️⃣ Input Image → CNN Feature Map

  • The input image is passed through a backbone CNN (like ResNet, VGG, etc.).

  • This produces a shared convolutional feature map for the entire image.


2️⃣ Region Proposal Network (RPN)

  • A small sliding window goes over the feature map.

  • For each location, it predicts objectness scores + bounding box coordinates for multiple anchors (predefined boxes of different scales/aspect ratios).

  • The RPN outputs about 300 high-quality proposals per image.

📌 This replaces Selective Search entirely.


3️⃣ RoI Pooling (or RoI Align)

  • The top-N region proposals from RPN are used.

  • Each proposal is mapped back to the shared feature map and passed through a RoI Pooling layer (or RoI Align for better precision).

  • This extracts fixed-size feature vectors (e.g., 7×7) per region.


4️⃣ Fully Connected Layers

  • These pooled features are passed through fully connected layers (same as Fast R-CNN).


5️⃣ Two Output Heads per RoI

Each RoI now branches into:

✅ a. Softmax Classifier

  • Predicts the object class for the region (including “background”).

✅ b. Bounding Box Regressor

  • Refines the bounding box location.


🧊 Architecture Flow: Faster R-CNN

scss
[Input Image][Backbone CNN][Feature Map][Region Proposal Network (RPN)][Top N Proposals][RoI Pooling / RoI Align][FC Layers][Classifier] (Object Class) → [BBox Regressor] (Refined Box)

⚙️ What Makes RPN Special?

  • It’s a fully convolutional network trained jointly with the detection head.

  • It’s trained with a binary classification loss (object vs. background) + regression loss.

  • The RPN shares weights with the base CNN → efficient and elegant.


🎯 Key Improvements Over Fast R-CNN

FeatureFast R-CNNFaster R-CNN
Region ProposalsSelective Search (external, slow)RPN (internal, fast & learnable)
Speed~0.3s/image~0.2s/image (Real-time possible)
End-to-end trainingPartialFull end-to-end
Proposal QualityFixed heuristicLearned from data

🧩 Limitations?

  • Slower than single-shot detectors like YOLO or SSD for real-time tasks.

  • Slightly more complex to implement due to the two-stage architecture.


✅ Summary

Faster R-CNN = Fast R-CNN + Region Proposal Network, all trained together in one deep learning pipeline.

It is:

  • Accurate 🔍

  • Modular ⚙️

  • End-to-end trainable 🧠

  • And still forms the backbone of modern detectors, including Mask R-CNN, Cascade R-CNN, and more.

No comments:

Post a Comment

🧠 You Only Laugh Once: Creativity and Humor in Deep Learning Community

It all started with a simple truth: Attention Is All You Need . Or at least, that’s what the transformers keep whispering at every AI confer...