My Research Notes: December 2024

Saturday, 7 December 2024

What is ResNet

ResNet (Residual Network) is a groundbreaking deep neural network architecture introduced by Microsoft Research in 2015. It was designed to address the vanishing gradient problem and enable the training of very deep networks, which were previously difficult to optimize effectively.

Key Concepts in ResNet

Deep Networks and the Vanishing Gradient Problem:
- As neural networks become deeper, the gradients during backpropagation tend to diminish, making it challenging to update the weights of earlier layers.
- This can lead to a network where additional layers degrade performance rather than improve it (called the degradation problem).
Residual Learning:
- ResNet introduced a concept called residual connections (or skip connections).
- Instead of learning the direct mapping $H(x)$ (desired output), it learns the residual function $F(x) = H(x) - x$ , reformulating the problem as: $H(x) = F(x) + x$
- The residual connection directly adds the input $x$ to the output of a block, ensuring that the network learns only the residual $F (x), which is often easier to optimize.$
Residual Block:
- A residual block is the fundamental building unit of ResNet. It consists of:
  - Two or three convolutional layers.
  - A skip connection that bypasses these layers and adds the input to the output.
  - Batch normalization (BN) and ReLU activation are applied before or after convolutions.
  Mathematically:
  $y = F(x, \{W_i\}) + x$
  Where $F(x, \{W_i\})$ represents the convolutional operations with weights $\{W_i\}$ .
Bottleneck Architecture:
- For deeper versions of ResNet (e.g., ResNet-50, ResNet-101), a bottleneck block is used to reduce computational cost:
  - First, reduce the dimensionality of the input with a $1 \times 1 convolution.$
  - Apply a $3 \times 3$ convolution for feature extraction.
  - Restore dimensionality with another $1 \times 1$ convolution.

ResNet Architectures

ResNet comes in various depths, commonly referred to by the number of layers:

ResNet-18: 18 layers (basic blocks).
ResNet-34: 34 layers (basic blocks).
ResNet-50: 50 layers (bottleneck blocks).
ResNet-101: 101 layers (bottleneck blocks).
ResNet-152: 152 layers (bottleneck blocks).

Basic Block (used in ResNet-18, ResNet-34):

Two $3 \times 3$ convolutions with a skip connection.

Bottleneck Block (used in ResNet-50, ResNet-101, ResNet-152):

A $1 \times 1$ convolution for dimensionality reduction.
A $3 \times 3$ convolution for feature extraction.
A $1 \times 1$ convolution to restore dimensionality.

Strengths of ResNet

Enabling Very Deep Networks:
- Networks with hundreds or even thousands of layers can be trained effectively.
Improved Gradient Flow:
- Residual connections ensure that gradients flow directly through the skip paths during backpropagation, reducing the vanishing gradient problem.
High Accuracy:
- ResNet achieved top results on benchmarks like ImageNet and COCO.

Limitations of ResNet

Computational Cost:
- Deeper models like ResNet-152 are computationally expensive.
Inefficiency for Small Networks:
- For small tasks, the residual connections might not provide significant benefits.

Applications of ResNet

Image Classification:
- Won the ImageNet challenge in 2015.
Object Detection:
- Backbone for models like Faster R-CNN, Mask R-CNN.
Semantic Segmentation:
- Used in models like DeepLab.

Variants of ResNet

ResNeXt:
- Uses grouped convolutions for better accuracy-efficiency trade-off.
Wide ResNet:
- Increases the width of layers instead of depth for better performance.
ResNet-D:
- Incorporates modifications for better feature extraction in classification and detection tasks.

What is MobileNet

MobileNet is a family of efficient convolutional neural network architectures designed primarily for mobile and embedded vision applications where computational resources and power are constrained. It was developed by Google, with the goal of maintaining high accuracy while significantly reducing model size and inference time.

Here’s an overview of MobileNet:

Key Concepts in MobileNet

Depthwise Separable Convolutions:
- Standard Convolution: Combines spatial filtering and channel-wise projection in a single step.
- Depthwise Separable Convolution splits this into two steps:
  1. Depthwise Convolution: A single filter per input channel (spatial filtering).
  2. Pointwise Convolution: Uses $1 \times 1$ convolutions to combine the output of the depthwise convolution (channel-wise projection).
- This separation drastically reduces computational cost by performing fewer operations.
Computational Reduction: If the input has $M$ channels, the output has $N$ channels, and the filter size is $D_k \times D_k$ :
- Standard convolution: $M \times N \times D_k \times D_k$
- Depthwise separable convolution: $M \times D_k \times D_k + M \times N$
This is a significant reduction in operations, especially for large $D_k$ , $M$ , or $N$ .
Width Multiplier ( $\alpha$ ):
- Controls the number of channels in each layer.
- Ranges from $0 < \alpha \leq 1$ where smaller $\alpha$ reduces the number of parameters and computations but also decreases model capacity.
Resolution Multiplier ( $\rho$ ):
- Reduces the input image resolution by a factor.
- Helps scale down the model size and computation for lower-resolution inputs.
Bottleneck Layers (in MobileNetV2):
- In MobileNetV2, a bottleneck structure with an expansion factor is used, introducing:
  - Inverted Residuals: Channels are expanded and then reduced.
  - Linear Bottleneck: Helps retain information better during down-sampling.

Versions of MobileNet

MobileNetV1 (2017)

Introduced depthwise separable convolutions and width/resolution multipliers.
Strikes a good balance between accuracy and efficiency.
Suitable for tasks like image classification, object detection, and segmentation on mobile devices.

MobileNetV2 (2018)

Introduced inverted residual blocks and linear bottlenecks to improve performance.
Achieved higher accuracy for a similar computational cost compared to MobileNetV1.
Became the backbone for many mobile-friendly deep learning tasks.

MobileNetV3 (2019)

Combines NAS (Neural Architecture Search) with manual design.
Incorporates advanced building blocks such as Squeeze-and-Excitation (SE) layers for channel attention.
Further optimizations for both latency and accuracy.
Released in two variants:
- MobileNetV3-Small: Prioritizes low latency and efficiency.
- MobileNetV3-Large: Focuses on higher accuracy for slightly higher computational cost.

Applications of MobileNet

Image Classification: Lightweight models for real-time classification.
Object Detection: Backbone for models like SSD (Single Shot Detector).
Semantic Segmentation: Used in models like DeepLab.
Edge Devices: Running neural networks on smartphones, drones, or IoT devices.

Advantages of MobileNet

Lightweight: Small model size and fewer parameters.
Fast Inference: Optimized for low-latency applications.
Scalable: Adjustable via width and resolution multipliers.
Accurate: Retains competitive accuracy despite being lightweight.

What is BERT

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model developed by Google. It is designed to understand the context of words in a sentence more effectively by considering their surroundings (both before and after the word). BERT is based on the Transformer architecture, specifically focusing on the encoder portion of Transformers.

Here’s a breakdown of what makes BERT unique and powerful:

1. What Does BERT Do?

BERT is a pre-trained language model that can be fine-tuned for a wide range of NLP tasks, such as:

Text classification (e.g., sentiment analysis)
Named entity recognition (NER) (e.g., identifying proper nouns in text)
Question answering (e.g., SQuAD dataset tasks)
Language inference (e.g., entailment tasks)
Text similarity (e.g., finding similar documents)

2. Key Features of BERT

Bidirectional Context Understanding

Unlike earlier models like GPT (which processes text from left-to-right or right-to-left), BERT processes text bidirectionally. This means it looks at the entire sentence, both before and after a word, to understand its meaning in context.

For example:

In the sentence: "I went to the bank to deposit money."
- BERT understands "bank" as a financial institution because of the surrounding words.
In the sentence: "I sat by the bank of the river."
- BERT understands "bank" as a riverbank due to the context.

Pre-training and Fine-tuning

BERT is trained in two steps:

Pre-training: BERT is trained on large text corpora (like Wikipedia and books) using unsupervised tasks:
- Masked Language Modeling (MLM): Some words in the sentence are masked (e.g., "I love [MASK] programming."), and the model learns to predict the masked word.
- Next Sentence Prediction (NSP): The model learns relationships between sentence pairs (e.g., predicting if two sentences are logically connected).
Fine-tuning: Once pre-trained, BERT can be fine-tuned for specific tasks by adding a small, task-specific layer on top of the model.

Transformer-Based Architecture

BERT uses the Transformer architecture, which relies on the self-attention mechanism. Self-attention helps the model focus on the most relevant parts of the input sentence for understanding each word.

3. Advantages of BERT

Contextualized Word Embeddings: Words are represented dynamically based on context, unlike static embeddings like Word2Vec or GloVe.
Versatility: Can be applied to numerous NLP tasks with minimal adjustments.
High Accuracy: Achieves state-of-the-art performance on many benchmarks (e.g., GLUE, SQuAD).

4. Limitations of BERT

Computationally Expensive: Pre-training and fine-tuning require significant computational resources.
Token Limit: Standard BERT models can process a maximum of 512 tokens, limiting their use for very long texts.
Data Hunger: Fine-tuning requires large amounts of labeled data for specific tasks.

5. Variants of BERT

Several variations of BERT have been developed to address its limitations or enhance performance:

DistilBERT: A smaller, faster version of BERT.
RoBERTa: An improved version with better training techniques.
ALBERT: A lightweight BERT with reduced parameters for efficiency.
TinyBERT: Optimized for mobile and edge devices.

6. Applications of BERT

Search Engines: Google Search uses BERT to understand user queries better.
Chatbots: Improves conversational understanding.
Content Moderation: Detects inappropriate or harmful content.
Healthcare: Analyzes medical records or research papers for insights.

In essence, BERT revolutionized NLP by enabling machines to understand the nuances of human language better than ever before.

What is RELU

ReLU (Rectified Linear Unit) is a widely used activation function in neural networks. It introduces non-linearity into the model, enabling the network to learn complex patterns. Here's a detailed explanation:

Definition

The ReLU activation function is mathematically defined as:

f(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}

In simpler terms:

For positive input ( $x > 0$ ), the output is the same as the input ( $f(x) = x$ ).
For non-positive input ( $x \leq 0$ ), the output is zero ( $f (x) = 0).$

Key Features of ReLU

Simplicity: ReLU is computationally efficient because it involves only a threshold operation, making it faster than other activation functions like sigmoid or tanh.
Non-linearity: Despite its simplicity, ReLU introduces non-linearity, which is crucial for a neural network to learn complex relationships in data.
Sparsity: ReLU often results in sparsity in activations, meaning only some neurons are activated (non-zero output). This can make the model more efficient and easier to interpret.

Advantages

Avoids Vanishing Gradient: Unlike sigmoid or tanh, ReLU does not saturate in the positive region, reducing the chances of vanishing gradients during backpropagation.
Computational Efficiency: Simple operations make it faster to compute.
Improved Convergence: ReLU often leads to faster convergence during training compared to sigmoid or tanh.

Disadvantages

Dead Neurons: Some neurons may always output zero if they fall into the $x \leq 0$ region and never recover. This is known as the dying ReLU problem.
Unbounded Output: ReLU outputs can become very large, which might cause issues in certain scenarios like overfitting or instability in optimization.

Variants of ReLU

To address its limitations, several variants of ReLU have been developed:

Leaky ReLU: Allows a small, non-zero gradient for negative inputs.
$f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$
where $α is a small positive constant (e.g., 0.01).$
Parametric ReLU (PReLU): Similar to Leaky ReLU but learns $\alpha$ during training.
Exponential Linear Unit (ELU): Smoothens the output for negative inputs instead of setting them to zero.
Scaled Exponential Linear Unit (SELU): A self-normalizing variant of ELU.

Applications

ReLU is extensively used in:

Deep Neural Networks (DNNs)
Convolutional Neural Networks (CNNs)
Image classification, natural language processing, and other AI tasks.

ReLU has revolutionized deep learning by making training more efficient and enabling deeper networks. Despite its challenges, its simplicity and effectiveness make it a go-to choice for many neural network architectures.

Sunday, 1 December 2024

Contrastive Loss

Contrastive Loss is a key loss function used in Siamese networks and other neural network architectures for learning embeddings, specifically designed to learn a feature space where similar inputs are close together and dissimilar inputs are far apart. This is especially useful in tasks like face verification, image similarity, and other comparison-based applications.

Definition

The Contrastive Loss is calculated for pairs of inputs, where each pair is labeled as either:

Similar (label = 0): The inputs belong to the same class.
Dissimilar (label = 1): The inputs belong to different classes.

The loss is formulated to:

Minimize the distance between embeddings of similar pairs.
Maximize the distance between embeddings of dissimilar pairs, up to a defined margin.

Mathematical Formula

L = (1 - Y) \cdot \frac{1}{2} \cdot D^{2} + Y \cdot \frac{1}{2} \cdot \max (0, m - D)^{2}

Where:

$L$ : Contrastive loss.
$Y$ : Binary label (0 for similar, 1 for dissimilar).
$D$ : Distance between the embeddings of the two inputs, typically computed as Euclidean distance: $D = ∥ f (x_{1}) - f (x_{2}) ∥$ where $f (x_{1})$ and $f (x_{2})$ are the embeddings of the two inputs.
$m$ : Margin, a hyperparameter that defines the minimum distance for dissimilar pairs to not incur loss.

How It Works

Similar Pairs ( $Y = 0$ ):
- The loss is proportional to $D^{2}$ , encouraging the distance $D$ to be as small as possible, i.e., embeddings of similar pairs should be close.
Dissimilar Pairs ( $Y = 1$ ):
- The loss is proportional to $\max (0, m - D)^{2}$ .
- If $D \geq m$ , the loss is 0, meaning the network does not penalize dissimilar pairs that are already far enough apart.
- If $D < m$ , the loss increases, pushing the embeddings farther apart.

Intuition Behind the Formula

The first term ensures that similar pairs are close in the embedding space.
The second term prevents dissimilar pairs from being too close in the embedding space.
The margin $m$ acts as a buffer, beyond which dissimilar pairs are considered sufficiently far apart.

Advantages

Flexibility: Allows learning embeddings in an unsupervised or semi-supervised manner by using similarity labels.
Effectiveness: Ensures meaningful separation of classes in the embedding space, which is essential for tasks like face verification or signature matching.

Challenges

Margin Selection: Choosing an appropriate value for $m$ is crucial; too small a margin may not separate classes effectively, and too large a margin may slow down convergence.
Pair Construction: Requires carefully balanced positive (similar) and negative (dissimilar) pairs for training.

Applications

Face Verification: Learn embeddings where faces of the same person are close and faces of different people are far apart.
Signature Verification: Distinguish between genuine and forged signatures.
Image Retrieval: Rank images based on their similarity to a query image.

Comparison with Other Loss Functions

Triplet Loss: Contrastive loss uses pairs, whereas triplet loss works with triplets (anchor, positive, and negative examples) to optimize embedding distances.
Cross-Entropy Loss: Contrastive loss focuses on distances in the embedding space rather than class probabilities.

Contrastive Loss is a powerful tool for metric learning and is particularly well-suited for applications involving similarity or verification.