🧠 What is R-CNN?
R-CNN (proposed by Ross Girshick et al. in 2014) was a landmark model that combined:
-
Region proposal methods (to suggest "where" objects might be),
-
With Convolutional Neural Networks (CNNs) (to learn "what" those regions contain),
-
To perform object detection — classifying and localizing objects in images.
🔁 Before R-CNN, traditional detection methods (like DPMs) used hand-crafted features (e.g., HOG). R-CNN was one of the first to apply deep learning to detection.
🧱 Architecture Stack of R-CNN
Here’s how R-CNN’s architecture is structured step-by-step:
🧩 1. Input Image → Region Proposals
-
R-CNN starts with the input image.
-
Uses a classical algorithm (usually Selective Search) to generate about 2000 region proposals (bounding boxes likely to contain objects).
📌 These proposals are class-agnostic — they just suggest “something might be here.”
🧩 2. Warping Region Proposals
-
Each region proposal is cropped and resized (typically to 227×227 pixels).
-
This resizing ensures compatibility with the input size required by the CNN.
🧠 3. CNN Feature Extraction
-
Each warped region is passed through a pre-trained CNN (like AlexNet or VGG).
-
The CNN outputs a feature vector (often from the
fc7layer, a fully connected layer).
📌 This stage is the deep learning heart of R-CNN.
📈 4. Region Classification (SVM)
-
For each class, a separate linear SVM is trained using the extracted features.
-
Each region is scored by these SVMs to determine which object class (if any) it contains.
📌 CNN acts as a feature extractor, while SVMs do the classification.
📐 5. Bounding Box Regression
-
To refine the position of bounding boxes, a bounding box regressor is trained.
-
It predicts adjustments to make the detected box align better with the ground truth.
🔄 R-CNN Pipeline Summary
📉 Limitations of R-CNN
While it was a breakthrough, R-CNN had limitations:
-
Very slow: Each region proposal is passed through the CNN separately → expensive!
-
Multi-stage pipeline: CNN + SVM + regressor are trained separately.
-
Heavy storage: Extracted features need to be stored on disk for training SVMs.
🧬 Evolution After R-CNN
To overcome these problems, later versions were developed:
-
Fast R-CNN: Uses a single CNN pass for the entire image.
-
Faster R-CNN: Adds a Region Proposal Network (RPN) to make the entire process end-to-end.
-
Mask R-CNN: Adds segmentation capabilities on top of Faster R-CNN.
⚡️ What is Fast R-CNN?
Fast R-CNN (Girshick, 2015) is a refined and faster version of R-CNN. It avoids redundant CNN computations for each region proposal by:
✅ Running the CNN once per image,
✅ Using Region of Interest (RoI) pooling to extract features for each proposal, and
✅ Training everything (classification + bounding box regression) in a single end-to-end network.
🧱 Architecture Stack of Fast R-CNN
Let’s break down the architecture and flow:
1️⃣ Input Image → CNN Feature Map
-
The entire image is passed once through a deep CNN (like VGG16 or ResNet).
-
This results in a convolutional feature map representing the entire image.
📌 Unlike R-CNN, Fast R-CNN avoids repeating this step for every proposal.
2️⃣ Region Proposals (RoIs)
-
Region proposals (still from Selective Search in this version) are generated.
-
Each proposal is a bounding box over the image.
3️⃣ RoI Pooling
-
Each region proposal is mapped onto the feature map.
-
A RoI Pooling layer extracts a fixed-size feature vector (e.g., 7×7) from each proposal's region on the feature map.
📌 This makes it possible to feed different-sized regions into fully connected layers.
4️⃣ Fully Connected Layers (FC layers)
-
The pooled feature vectors from RoI pooling are fed into shared FC layers.
-
This generates high-level features for each RoI.
5️⃣ Two Output Heads (for each RoI)
Each region now branches into two outputs:
✅ a. Softmax Classifier
-
Predicts the class label for each region proposal.
-
Includes a “background” class for non-object regions.
✅ b. Bounding Box Regressor
-
Predicts refined coordinates for the bounding box.
📊 Fast R-CNN Architecture Flow
🎯 Key Improvements Over R-CNN
| Feature | R-CNN | Fast R-CNN |
|---|---|---|
| CNN run per proposal | Yes (2000× per image) | No (1× per image) |
| Feature reuse | ❌ No | ✅ Yes (shared feature map) |
| Training pipeline | 3-stage | End-to-end |
| Detection speed | Slow (∼47s/image) | Fast (∼0.3s/image with VGG16) |
| Storage requirement | High (features saved to disk) | Low (end-to-end in memory) |
🧩 Limitations of Fast R-CNN
-
Still relies on Selective Search for region proposals, which is slow and external to the network.
-
This bottleneck was solved in the next evolution: Faster R-CNN.
🚀 What is Faster R-CNN?
Faster R-CNN (by Ren, He, Girshick, and Sun, 2015) is an evolution of Fast R-CNN. It solves the biggest bottleneck of Fast R-CNN: slow, external region proposal generation (e.g., Selective Search).
The breakthrough? It introduces a learnable, fully convolutional Region Proposal Network (RPN) inside the CNN itself.
📌 Now, both region proposal generation and object detection are done in a single, unified, deep learning model.
🧱 Architecture Stack of Faster R-CNN
Let’s go step by step through how the architecture is stacked:
1️⃣ Input Image → CNN Feature Map
-
The input image is passed through a backbone CNN (like ResNet, VGG, etc.).
-
This produces a shared convolutional feature map for the entire image.
2️⃣ Region Proposal Network (RPN)
-
A small sliding window goes over the feature map.
-
For each location, it predicts objectness scores + bounding box coordinates for multiple anchors (predefined boxes of different scales/aspect ratios).
-
The RPN outputs about 300 high-quality proposals per image.
📌 This replaces Selective Search entirely.
3️⃣ RoI Pooling (or RoI Align)
-
The top-N region proposals from RPN are used.
-
Each proposal is mapped back to the shared feature map and passed through a RoI Pooling layer (or RoI Align for better precision).
-
This extracts fixed-size feature vectors (e.g., 7×7) per region.
4️⃣ Fully Connected Layers
-
These pooled features are passed through fully connected layers (same as Fast R-CNN).
5️⃣ Two Output Heads per RoI
Each RoI now branches into:
✅ a. Softmax Classifier
-
Predicts the object class for the region (including “background”).
✅ b. Bounding Box Regressor
-
Refines the bounding box location.
🧊 Architecture Flow: Faster R-CNN
⚙️ What Makes RPN Special?
-
It’s a fully convolutional network trained jointly with the detection head.
-
It’s trained with a binary classification loss (object vs. background) + regression loss.
-
The RPN shares weights with the base CNN → efficient and elegant.
🎯 Key Improvements Over Fast R-CNN
| Feature | Fast R-CNN | Faster R-CNN |
|---|---|---|
| Region Proposals | Selective Search (external, slow) | RPN (internal, fast & learnable) |
| Speed | ~0.3s/image | ~0.2s/image (Real-time possible) |
| End-to-end training | Partial | Full end-to-end |
| Proposal Quality | Fixed heuristic | Learned from data |
🧩 Limitations?
-
Slower than single-shot detectors like YOLO or SSD for real-time tasks.
-
Slightly more complex to implement due to the two-stage architecture.
✅ Summary
Faster R-CNN = Fast R-CNN + Region Proposal Network, all trained together in one deep learning pipeline.
It is:
-
Accurate 🔍
-
Modular ⚙️
-
End-to-end trainable 🧠
-
And still forms the backbone of modern detectors, including Mask R-CNN, Cascade R-CNN, and more.
No comments:
Post a Comment