My Research Notes: R CNN, Fast RCNN and Faster RCNN

🧠 What is R-CNN?

R-CNN (proposed by Ross Girshick et al. in 2014) was a landmark model that combined:

Region proposal methods (to suggest "where" objects might be),
With Convolutional Neural Networks (CNNs) (to learn "what" those regions contain),
To perform object detection — classifying and localizing objects in images.

🔁 Before R-CNN, traditional detection methods (like DPMs) used hand-crafted features (e.g., HOG). R-CNN was one of the first to apply deep learning to detection.

🧱 Architecture Stack of R-CNN

Here’s how R-CNN’s architecture is structured step-by-step:

🧩 1. Input Image → Region Proposals

R-CNN starts with the input image.
Uses a classical algorithm (usually Selective Search) to generate about 2000 region proposals (bounding boxes likely to contain objects).

📌 These proposals are class-agnostic — they just suggest “something might be here.”

🧩 2. Warping Region Proposals

Each region proposal is cropped and resized (typically to 227×227 pixels).
This resizing ensures compatibility with the input size required by the CNN.

🧠 3. CNN Feature Extraction

Each warped region is passed through a pre-trained CNN (like AlexNet or VGG).
The CNN outputs a feature vector (often from the fc7 layer, a fully connected layer).

📌 This stage is the deep learning heart of R-CNN.

📈 4. Region Classification (SVM)

For each class, a separate linear SVM is trained using the extracted features.
Each region is scored by these SVMs to determine which object class (if any) it contains.

📌 CNN acts as a feature extractor, while SVMs do the classification.

📐 5. Bounding Box Regression

To refine the position of bounding boxes, a bounding box regressor is trained.
It predicts adjustments to make the detected box align better with the ground truth.

🔄 R-CNN Pipeline Summary

css
[Input Image] 
   → [Selective Search] 
      → [2000 Proposals]
         → [Warp each to 227x227]
            → [CNN (AlexNet/VGG)] 
               → [Feature Vector]
                  → [SVM Classifier] 
                     + [Bounding Box Regressor]

📉 Limitations of R-CNN

While it was a breakthrough, R-CNN had limitations:

Very slow: Each region proposal is passed through the CNN separately → expensive!
Multi-stage pipeline: CNN + SVM + regressor are trained separately.
Heavy storage: Extracted features need to be stored on disk for training SVMs.

🧬 Evolution After R-CNN

To overcome these problems, later versions were developed:

Fast R-CNN: Uses a single CNN pass for the entire image.
Faster R-CNN: Adds a Region Proposal Network (RPN) to make the entire process end-to-end.
Mask R-CNN: Adds segmentation capabilities on top of Faster R-CNN.

⚡️ What is Fast R-CNN?

Fast R-CNN (Girshick, 2015) is a refined and faster version of R-CNN. It avoids redundant CNN computations for each region proposal by:

✅ Running the CNN once per image,
✅ Using Region of Interest (RoI) pooling to extract features for each proposal, and
✅ Training everything (classification + bounding box regression) in a single end-to-end network.

🧱 Architecture Stack of Fast R-CNN

Let’s break down the architecture and flow:

1️⃣ Input Image → CNN Feature Map

The entire image is passed once through a deep CNN (like VGG16 or ResNet).
This results in a convolutional feature map representing the entire image.

📌 Unlike R-CNN, Fast R-CNN avoids repeating this step for every proposal.

2️⃣ Region Proposals (RoIs)

Region proposals (still from Selective Search in this version) are generated.
Each proposal is a bounding box over the image.

3️⃣ RoI Pooling

Each region proposal is mapped onto the feature map.
A RoI Pooling layer extracts a fixed-size feature vector (e.g., 7×7) from each proposal's region on the feature map.

📌 This makes it possible to feed different-sized regions into fully connected layers.

4️⃣ Fully Connected Layers (FC layers)

The pooled feature vectors from RoI pooling are fed into shared FC layers.
This generates high-level features for each RoI.

5️⃣ Two Output Heads (for each RoI)

Each region now branches into two outputs:

✅ a. Softmax Classifier

Predicts the class label for each region proposal.
Includes a “background” class for non-object regions.

✅ b. Bounding Box Regressor

Predicts refined coordinates for the bounding box.

📊 Fast R-CNN Architecture Flow

scss
[Input Image]
    → [CNN (e.g., VGG16)] → [Feature Map]
        + [Region Proposals (RoIs from Selective Search)]
            → [RoI Pooling (for each proposal)]
                → [Fully Connected Layers]
                    → [Softmax Classifier] (object class)
                    → [Bounding Box Regressor] (refinement)

🎯 Key Improvements Over R-CNN

Feature	R-CNN	Fast R-CNN
CNN run per proposal	Yes (2000× per image)	No (1× per image)
Feature reuse	❌ No	✅ Yes (shared feature map)
Training pipeline	3-stage	End-to-end
Detection speed	Slow (∼47s/image)	Fast (∼0.3s/image with VGG16)
Storage requirement	High (features saved to disk)	Low (end-to-end in memory)

🧩 Limitations of Fast R-CNN

Still relies on Selective Search for region proposals, which is slow and external to the network.
This bottleneck was solved in the next evolution: Faster R-CNN.

🚀 What is Faster R-CNN?

Faster R-CNN (by Ren, He, Girshick, and Sun, 2015) is an evolution of Fast R-CNN. It solves the biggest bottleneck of Fast R-CNN: slow, external region proposal generation (e.g., Selective Search).

The breakthrough? It introduces a learnable, fully convolutional Region Proposal Network (RPN) inside the CNN itself.

📌 Now, both region proposal generation and object detection are done in a single, unified, deep learning model.

🧱 Architecture Stack of Faster R-CNN

Let’s go step by step through how the architecture is stacked:

1️⃣ Input Image → CNN Feature Map

The input image is passed through a backbone CNN (like ResNet, VGG, etc.).
This produces a shared convolutional feature map for the entire image.

2️⃣ Region Proposal Network (RPN)

A small sliding window goes over the feature map.
For each location, it predicts objectness scores + bounding box coordinates for multiple anchors (predefined boxes of different scales/aspect ratios).
The RPN outputs about 300 high-quality proposals per image.

📌 This replaces Selective Search entirely.

3️⃣ RoI Pooling (or RoI Align)

The top-N region proposals from RPN are used.
Each proposal is mapped back to the shared feature map and passed through a RoI Pooling layer (or RoI Align for better precision).
This extracts fixed-size feature vectors (e.g., 7×7) per region.

4️⃣ Fully Connected Layers

These pooled features are passed through fully connected layers (same as Fast R-CNN).

5️⃣ Two Output Heads per RoI

Each RoI now branches into:

✅ a. Softmax Classifier

Predicts the object class for the region (including “background”).

✅ b. Bounding Box Regressor

Refines the bounding box location.

🧊 Architecture Flow: Faster R-CNN

scss
[Input Image]
    → [Backbone CNN] → [Feature Map]
        → [Region Proposal Network (RPN)]
            → [Top N Proposals]
                → [RoI Pooling / RoI Align]
                    → [FC Layers]
                        → [Classifier] (Object Class)
                        → [BBox Regressor] (Refined Box)

⚙️ What Makes RPN Special?

It’s a fully convolutional network trained jointly with the detection head.
It’s trained with a binary classification loss (object vs. background) + regression loss.
The RPN shares weights with the base CNN → efficient and elegant.

🎯 Key Improvements Over Fast R-CNN

Feature	Fast R-CNN	Faster R-CNN
Region Proposals	Selective Search (external, slow)	RPN (internal, fast & learnable)
Speed	~0.3s/image	~0.2s/image (Real-time possible)
End-to-end training	Partial	Full end-to-end
Proposal Quality	Fixed heuristic	Learned from data

🧩 Limitations?

Slower than single-shot detectors like YOLO or SSD for real-time tasks.
Slightly more complex to implement due to the two-stage architecture.

✅ Summary

Faster R-CNN = Fast R-CNN + Region Proposal Network, all trained together in one deep learning pipeline.

It is:

Accurate 🔍
Modular ⚙️
End-to-end trainable 🧠
And still forms the backbone of modern detectors, including Mask R-CNN, Cascade R-CNN, and more.

Monday, 21 April 2025

R CNN, Fast RCNN and Faster RCNN

🧠 What is R-CNN?

🧱 Architecture Stack of R-CNN

🧩 1. Input Image → Region Proposals

🧩 2. Warping Region Proposals

🧠 3. CNN Feature Extraction

📈 4. Region Classification (SVM)

📐 5. Bounding Box Regression

🔄 R-CNN Pipeline Summary

📉 Limitations of R-CNN

🧬 Evolution After R-CNN

⚡️ What is Fast R-CNN?

🧱 Architecture Stack of Fast R-CNN

1️⃣ Input Image → CNN Feature Map

2️⃣ Region Proposals (RoIs)

3️⃣ RoI Pooling

4️⃣ Fully Connected Layers (FC layers)

5️⃣ Two Output Heads (for each RoI)

✅ a. Softmax Classifier

✅ b. Bounding Box Regressor

📊 Fast R-CNN Architecture Flow

🎯 Key Improvements Over R-CNN

🧩 Limitations of Fast R-CNN

🚀 What is Faster R-CNN?

🧱 Architecture Stack of Faster R-CNN

1️⃣ Input Image → CNN Feature Map

2️⃣ Region Proposal Network (RPN)

3️⃣ RoI Pooling (or RoI Align)

4️⃣ Fully Connected Layers

5️⃣ Two Output Heads per RoI

✅ a. Softmax Classifier

✅ b. Bounding Box Regressor

🧊 Architecture Flow: Faster R-CNN

⚙️ What Makes RPN Special?

🎯 Key Improvements Over Fast R-CNN

🧩 Limitations?

✅ Summary

No comments:

Post a Comment

Understading the Paper: Fine Grained Image Analysis with Deep Learning