Understanding Matrix Dimensions in a Basic Neural Network
By Priyank Goyal
Grasping the role of matrix dimensions in a neural network is not just a technicality—it’s foundational. Many challenges in building and debugging deep learning models arise from mismatched dimensions in forward or backward passes. In this article, we explore each step of a basic feedforward neural network and carefully annotate the shapes of the involved matrices and vectors. This dimensional walkthrough complements the theoretical and code-level understanding of backpropagation, particularly for researchers, data scientists, and machine learning enthusiasts.
1. Network Architecture
We consider a compact neural network with the following configuration:
- Input Layer: 3 features
- Hidden Layer: 2 neurons with ReLU activation
- Output Layer: 1 neuron with sigmoid activation
The network performs binary classification. The full data flow is represented by:
\[ z_1 = W_1 x + b_1,\quad h = \text{ReLU}(z_1),\quad z_2 = W_2 h + b_2,\quad \hat{y} = \sigma(z_2) \]Each component’s shape determines the feasibility and correctness of these computations.
2. Shape Overview of Components
Let’s assign dimensions to each element. The following table summarizes the shapes used in the network:
| Component | Symbol | Shape | Meaning |
|---|---|---|---|
| Input | \( x \) | 3×1 | Input column vector with 3 features |
| Hidden weights | \( W_1 \) | 2×3 | 2 neurons, each connected to 3 inputs |
| Hidden bias | \( b_1 \) | 2×1 | 1 bias per hidden neuron |
| Hidden pre-activation | \( z_1 \) | 2×1 | Weighted sum before activation |
| Hidden activation | \( h \) | 2×1 | After ReLU |
| Output weights | \( W_2 \) | 1×2 | Connects 2 hidden outputs to 1 output |
| Output bias | \( b_2 \) | 1×1 | Bias added at output node |
| Output pre-activation | \( z_2 \) | 1×1 | Scalar score before sigmoid |
| Output prediction | \( \hat{y} \) | 1×1 | Sigmoid output: predicted probability |
3. Forward Pass: Dimensions at Each Step
Step 1: Hidden Pre-Activation
\[ z_1 = W_1 x + b_1 \] - \( W_1 \in \mathbb{R}^{2 \times 3} \) - \( x \in \mathbb{R}^{3 \times 1} \) - \( W_1 x \in \mathbb{R}^{2 \times 1} \) - \( b_1 \in \mathbb{R}^{2 \times 1} \) - Result: \( z_1 \in \mathbb{R}^{2 \times 1} \)Step 2: Hidden Activation
\[ h = \text{ReLU}(z_1) \in \mathbb{R}^{2 \times 1} \]Step 3: Output Pre-Activation
\[ z_2 = W_2 h + b_2 \] - \( W_2 \in \mathbb{R}^{1 \times 2} \) - \( h \in \mathbb{R}^{2 \times 1} \) - \( W_2 h \in \mathbb{R}^{1 \times 1} \) - \( b_2 \in \mathbb{R}^{1 \times 1} \) - Result: \( z_2 \in \mathbb{R}^{1 \times 1} \)Step 4: Output
\[ \hat{y} = \sigma(z_2) \in \mathbb{R}^{1 \times 1} \] ---4. Backward Pass: Gradient Dimensions
We now track the shape of each gradient flowing backward from the loss \( \mathcal{L} \).
Step 1: Gradient w.r.t. \( z_2 \)
\[ \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{1 \times 1} \]Step 2: Gradient w.r.t. Output Layer Parameters
\[ \frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot h^T \in \mathbb{R}^{1 \times 2} \] \[ \frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{1 \times 1} \]Step 3: Gradient w.r.t. Hidden Layer Activation
\[ \frac{\partial \mathcal{L}}{\partial h} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{2 \times 1} \]Step 4: Gradient w.r.t. \( z_1 \) Using ReLU Derivative
\[ \frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial h} \circ f'(z_1) \in \mathbb{R}^{2 \times 1} \]Step 5: Gradient w.r.t. Hidden Layer Parameters
\[ \frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T \in \mathbb{R}^{2 \times 3} \] \[ \frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \in \mathbb{R}^{2 \times 1} \]5. Python Code with Shape Comments
# Input
x = np.array([[1.0], [0.5], [-1.5]]) # shape: (3,1)
# Parameters
W1 = np.random.randn(2, 3) # shape: (2,3)
b1 = np.random.randn(2, 1) # shape: (2,1)
W2 = np.random.randn(1, 2) # shape: (1,2)
b2 = np.random.randn(1, 1) # shape: (1,1)
# Forward pass
z1 = W1 @ x + b1 # shape: (2,1)
h = np.maximum(0, z1) # shape: (2,1)
z2 = W2 @ h + b2 # shape: (1,1)
y_pred = 1 / (1 + np.exp(-z2)) # shape: (1,1)
# Backward pass
dy = -(1 / y_pred) * (y_pred * (1 - y_pred)) # shape: (1,1)
dW2 = dy @ h.T # shape: (1,2)
db2 = dy # shape: (1,1)
dh = W2.T @ dy # shape: (2,1)
dz1 = dh * (z1 > 0).astype(float) # shape: (2,1)
dW1 = dz1 @ x.T # shape: (2,3)
db1 = dz1 # shape: (2,1)
6. Final Thoughts
When designing neural networks, especially from scratch, paying attention to matrix dimensions ensures model correctness and computational stability. It helps prevent silent shape mismatches and clarifies how data flows through the layers. Whether you’re performing forward passes, computing loss, or updating gradients, checking and tracking matrix dimensions is not optional—it’s essential.
This discipline becomes even more vital in deeper architectures, recurrent models, or when batching inputs. Understanding shapes is your first line of defense—and a powerful debugging ally.
No comments:
Post a Comment