Understanding Matrix Dimensions in a Basic Neural Network

By Priyank Goyal

Grasping the role of matrix dimensions in a neural network is not just a technicality—it’s foundational. Many challenges in building and debugging deep learning models arise from mismatched dimensions in forward or backward passes. In this article, we explore each step of a basic feedforward neural network and carefully annotate the shapes of the involved matrices and vectors. This dimensional walkthrough complements the theoretical and code-level understanding of backpropagation, particularly for researchers, data scientists, and machine learning enthusiasts.

1. Network Architecture

We consider a compact neural network with the following configuration:

Input Layer: 3 features
Hidden Layer: 2 neurons with ReLU activation
Output Layer: 1 neuron with sigmoid activation

The network performs binary classification. The full data flow is represented by:

\[ z_1 = W_1 x + b_1,\quad h = \text{ReLU}(z_1),\quad z_2 = W_2 h + b_2,\quad \hat{y} = \sigma(z_2) \]

Each component’s shape determines the feasibility and correctness of these computations.

2. Shape Overview of Components

Let’s assign dimensions to each element. The following table summarizes the shapes used in the network:

Component	Symbol	Shape	Meaning
Input	\( x \)	3×1	Input column vector with 3 features
Hidden weights	\( W_1 \)	2×3	2 neurons, each connected to 3 inputs
Hidden bias	\( b_1 \)	2×1	1 bias per hidden neuron
Hidden pre-activation	\( z_1 \)	2×1	Weighted sum before activation
Hidden activation	\( h \)	2×1	After ReLU
Output weights	\( W_2 \)	1×2	Connects 2 hidden outputs to 1 output
Output bias	\( b_2 \)	1×1	Bias added at output node
Output pre-activation	\( z_2 \)	1×1	Scalar score before sigmoid
Output prediction	\( \hat{y} \)	1×1	Sigmoid output: predicted probability

3. Forward Pass: Dimensions at Each Step

Step 1: Hidden Pre-Activation

\[ z_1 = W_1 x + b_1 \] - \( W_1 \in \mathbb{R}^{2 \times 3} \) - \( x \in \mathbb{R}^{3 \times 1} \) - \( W_1 x \in \mathbb{R}^{2 \times 1} \) - \( b_1 \in \mathbb{R}^{2 \times 1} \) - Result: \( z_1 \in \mathbb{R}^{2 \times 1} \)

Step 2: Hidden Activation

\[ h = \text{ReLU}(z_1) \in \mathbb{R}^{2 \times 1} \]

Step 3: Output Pre-Activation

\[ z_2 = W_2 h + b_2 \] - \( W_2 \in \mathbb{R}^{1 \times 2} \) - \( h \in \mathbb{R}^{2 \times 1} \) - \( W_2 h \in \mathbb{R}^{1 \times 1} \) - \( b_2 \in \mathbb{R}^{1 \times 1} \) - Result: \( z_2 \in \mathbb{R}^{1 \times 1} \)

Step 4: Output

\[ \hat{y} = \sigma(z_2) \in \mathbb{R}^{1 \times 1} \] ---

4. Backward Pass: Gradient Dimensions

We now track the shape of each gradient flowing backward from the loss \( \mathcal{L} \).

Step 1: Gradient w.r.t. \( z_2 \)

\[ \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{1 \times 1} \]

Step 2: Gradient w.r.t. Output Layer Parameters

\[ \frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot h^T \in \mathbb{R}^{1 \times 2} \] \[ \frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{1 \times 1} \]

Step 3: Gradient w.r.t. Hidden Layer Activation

\[ \frac{\partial \mathcal{L}}{\partial h} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} \in \mathbb{R}^{2 \times 1} \]

Step 4: Gradient w.r.t. \( z_1 \) Using ReLU Derivative

\[ \frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial h} \circ f'(z_1) \in \mathbb{R}^{2 \times 1} \]

Step 5: Gradient w.r.t. Hidden Layer Parameters

\[ \frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T \in \mathbb{R}^{2 \times 3} \] \[ \frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \in \mathbb{R}^{2 \times 1} \]

5. Python Code with Shape Comments


# Input
x = np.array([[1.0], [0.5], [-1.5]])       # shape: (3,1)

# Parameters
W1 = np.random.randn(2, 3)                # shape: (2,3)
b1 = np.random.randn(2, 1)                # shape: (2,1)
W2 = np.random.randn(1, 2)                # shape: (1,2)
b2 = np.random.randn(1, 1)                # shape: (1,1)

# Forward pass
z1 = W1 @ x + b1                          # shape: (2,1)
h = np.maximum(0, z1)                     # shape: (2,1)
z2 = W2 @ h + b2                          # shape: (1,1)
y_pred = 1 / (1 + np.exp(-z2))            # shape: (1,1)

# Backward pass
dy = -(1 / y_pred) * (y_pred * (1 - y_pred))  # shape: (1,1)
dW2 = dy @ h.T                            # shape: (1,2)
db2 = dy                                 # shape: (1,1)

dh = W2.T @ dy                           # shape: (2,1)
dz1 = dh * (z1 > 0).astype(float)         # shape: (2,1)
dW1 = dz1 @ x.T                           # shape: (2,3)
db1 = dz1                                 # shape: (2,1)

6. Final Thoughts

When designing neural networks, especially from scratch, paying attention to matrix dimensions ensures model correctness and computational stability. It helps prevent silent shape mismatches and clarifies how data flows through the layers. Whether you’re performing forward passes, computing loss, or updating gradients, checking and tracking matrix dimensions is not optional—it’s essential.

This discipline becomes even more vital in deeper architectures, recurrent models, or when batching inputs. Understanding shapes is your first line of defense—and a powerful debugging ally.

My Research Notes

Saturday, 24 May 2025