Backpropagation Demystified: A Step-by-Step Guide with Code

By Priyank Goyal

Backpropagation is the engine that drives learning in neural networks. Despite its intimidating name, it is simply an application of the chain rule from calculus, organized to efficiently compute the gradient of a loss function with respect to all the parameters of a network. In this article, we’ll unpack backpropagation by taking you through a small neural network, step by step—from the forward pass, to the computation of gradients, and finally to updating the parameters. We’ll also write code to perform this process manually using NumPy.

1. Overview of the Neural Network

We’ll use a simple feedforward neural network with:

Input layer: 3 neurons (i.e., 3 input features)
Hidden layer: 2 neurons with ReLU activation
Output layer: 1 neuron with sigmoid activation

This network maps input features to a probability value \( \hat{y} \in [0, 1] \), useful for binary classification.

The mathematical flow of data is as follows:

\[ z_1 = W_1 x + b_1,\quad h = \text{ReLU}(z_1),\quad z_2 = W_2 h + b_2,\quad \hat{y} = \sigma(z_2) \]

Where:

\( x \in \mathbb{R}^{3 \times 1} \) is the input column vector
\( W_1 \in \mathbb{R}^{2 \times 3},\ b_1 \in \mathbb{R}^{2 \times 1} \)
\( W_2 \in \mathbb{R}^{1 \times 2},\ b_2 \in \mathbb{R}^{1 \times 1} \)

2. Forward Propagation

Let’s begin with initial values for inputs and weights:

x = np.array([[1.0], [0.5], [-1.5]])

W1 = np.array([[0.2, -0.4, 0.1],
               [0.7,  0.3, -0.5]])
b1 = np.array([[0.1],
               [-0.2]])

W2 = np.array([[0.5, -1.0]])
b2 = np.array([[0.2]])

Compute the intermediate outputs:

\[ z_1 = W_1 x + b_1 = \begin{bmatrix} -0.15 \\ 1.8 \end{bmatrix},\quad h = \text{ReLU}(z_1) = \begin{bmatrix} 0 \\ 1.8 \end{bmatrix} \] \[ z_2 = W_2 h + b_2 = -1.6,\quad \hat{y} = \sigma(z_2) = \frac{1}{1 + e^{1.6}} \approx 0.167 \]

3. Loss Function

Assuming the true label is \( y = 1 \), we use the binary cross-entropy loss:

\[ \mathcal{L} = -\left[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\right] \approx -\log(0.167) \approx 1.79 \]

This loss tells us how far off the prediction is from the ground truth.

4. Backpropagation: Computing Gradients

4.1. Calculating Gradient w.r.t. \( z_2 \)

\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{1}{\hat{y}}\]

Calculations

Assume the loss function is the binary cross-entropy loss:

\[ \mathcal{L} = -\left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right) \]

We differentiate with respect to \( \hat{y} \):

\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}} \]

Now, consider the special case where the true label \( y = 1 \). The loss function simplifies to:

\[ \mathcal{L} = -\log(\hat{y}) \]

Differentiating this gives:

\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{1}{\hat{y}} \]

Thus, the expression \[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{1}{\hat{y}} \] is correct in the case where \( y = 1 \).

\[\quad \frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1 - \hat{y}) \]

Calculation of \[\quad \frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1 - \hat{y}) \]

Step 1: Define the Sigmoid Function

Let the sigmoid function be defined as:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

We aim to compute the derivative:

\[ \frac{d}{dz} \left( \frac{1}{1 + e^{-z}} \right) \]

Step 2: Apply the Chain Rule

Let: \[ f(z) = \frac{1}{g(z)}, \quad \text{where } g(z) = 1 + e^{-z} \]

Then, using the chain rule: \[ \frac{df}{dz} = -\frac{1}{(g(z))^2} \cdot \frac{dg}{dz} \]

Compute the derivative of the inner function: \[ \frac{d}{dz}(1 + e^{-z}) = -e^{-z} \]

Substituting back: \[ \frac{d}{dz} \left( \frac{1}{1 + e^{-z}} \right) = -\frac{1}{(1 + e^{-z})^2} \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2} \]

Step 3: Express in Terms of Sigmoid

Recall that: \[ \sigma(z) = \frac{1}{1 + e^{-z}}, \quad 1 - \sigma(z) = \frac{e^{-z}}{1 + e^{-z}} \]

Multiplying both expressions: \[ \sigma(z)(1 - \sigma(z)) = \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \frac{e^{-z}}{(1 + e^{-z})^2} \]

✅ Final Result

\[ \frac{d}{dz} \left( \frac{1}{1 + e^{-z}} \right) = \sigma(z)(1 - \sigma(z)) \]

\[ \delta_2 = \frac{\partial \mathcal{L}}{\partial z_2} = -\frac{1}{0.167} \cdot 0.167(1 - 0.167) \approx -0.83 \]

4.2. Gradients for Output Layer

\[ \frac{\partial \mathcal{L}}{\partial W_2} = \delta_2 \cdot h^T = -0.83 \cdot \begin{bmatrix} 0 & 1.8 \end{bmatrix} = \begin{bmatrix} 0 & -1.49 \end{bmatrix} \] \[ \frac{\partial \mathcal{L}}{\partial b_2} = \delta_2 = -0.83 \]

4.3. Gradient w.r.t. Hidden Layer Output

\[ \frac{\partial \mathcal{L}}{\partial h} = W_2^T \cdot \delta_2 = \begin{bmatrix} 0.5 \\ -1.0 \end{bmatrix} \cdot -0.83 = \begin{bmatrix} -0.415 \\ 0.83 \end{bmatrix} \]

4.4. Gradient w.r.t. \( z_1 \) (ReLU Derivative)

\[ \frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial h} \circ f'(z_1) \]

Where \( f'(z_1) = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \), since ReLU’s derivative is 1 for positive inputs and 0 otherwise.

\[ \delta_1 = \begin{bmatrix} -0.415 \\ 0.83 \end{bmatrix} \circ \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0.83 \end{bmatrix} \]

4.5. Gradients for Hidden Layer

\[ \frac{\partial \mathcal{L}}{\partial W_1} = \delta_1 \cdot x^T = \begin{bmatrix} 0 & 0 & 0 \\ 0.83 & 0.415 & -1.245 \end{bmatrix} \] \[ \frac{\partial \mathcal{L}}{\partial b_1} = \delta_1 = \begin{bmatrix} 0 \\ 0.83 \end{bmatrix} \]

5. Update Step (Gradient Descent)

With a learning rate \( \eta = 0.1 \), we update:

\[ W_2 := W_2 - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_2},\quad b_2 := b_2 - \eta \cdot \frac{\partial \mathcal{L}}{\partial b_2} \] \[ W_1 := W_1 - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_1},\quad b_1 := b_1 - \eta \cdot \frac{\partial \mathcal{L}}{\partial b_1} \]

This updates all weights and biases in the direction that minimizes the loss.

6. Python Code Implementation


import numpy as np

x = np.array([[1.0], [0.5], [-1.5]])
y_true = 1

W1 = np.array([[0.2, -0.4, 0.1],
               [0.7,  0.3, -0.5]])
b1 = np.array([[0.1],
               [-0.2]])

W2 = np.array([[0.5, -1.0]])
b2 = np.array([[0.2]])

lr = 0.1

# Forward Pass
z1 = W1 @ x + b1
h = np.maximum(0, z1)
z2 = W2 @ h + b2
y_pred = 1 / (1 + np.exp(-z2))
loss = - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Backward Pass
dL_dz2 = -(y_true / y_pred) * (y_pred * (1 - y_pred))
dL_dW2 = dL_dz2 @ h.T
dL_db2 = dL_dz2

dL_dh = W2.T @ dL_dz2
dL_dz1 = dL_dh * (z1 > 0)
dL_dW1 = dL_dz1 @ x.T
dL_db1 = dL_dz1

# Parameter Update
W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1

6. Python Code Implementation-Tensor Flow


import tensorflow as tf

# Input and true label
x = tf.constant([[1.0], [0.5], [-1.5]])  # Shape (3, 1)
y_true = tf.constant([[1.0]])            # Shape (1, 1)

# Initialize weights and biases
W1 = tf.Variable([[0.2, -0.4, 0.1],
                  [0.7,  0.3, -0.5]], dtype=tf.float32)
b1 = tf.Variable([[0.1],
                  [-0.2]], dtype=tf.float32)

W2 = tf.Variable([[0.5, -1.0]], dtype=tf.float32)
b2 = tf.Variable([[0.2]], dtype=tf.float32)

lr = 0.1  # Learning rate

# Forward and backward pass using GradientTape
with tf.GradientTape() as tape:
    z1 = tf.matmul(W1, x) + b1
    h = tf.nn.relu(z1)
    z2 = tf.matmul(W2, h) + b2
    y_pred = tf.sigmoid(z2)
    loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)

# Compute gradients
grads = tape.gradient(loss, [W1, b1, W2, b2])
dW1, db1, dW2, db2 = grads

# Print gradients
print("Loss:", loss.numpy())
print("dL/dW1:\n", dW1.numpy())
print("dL/db1:\n", db1.numpy())
print("dL/dW2:\n", dW2.numpy())
print("dL/db2:\n", db2.numpy())

# Update weights and biases
W1.assign_sub(lr * dW1)
b1.assign_sub(lr * db1)
W2.assign_sub(lr * dW2)
b2.assign_sub(lr * db2)

7. Final Updated Parameters

Parameter	Updated Value
`W1`	\[ \begin{bmatrix} 0.2 & -0.4 & 0.1 \\ 0.623 & 0.262 & -0.385 \end{bmatrix} \]
`b1`	\[ \begin{bmatrix} 0.1 \\ -0.277 \end{bmatrix} \]
`W2`	\[ \begin{bmatrix} 0.5 & -0.892 \end{bmatrix} \]
`b2`	\( 0.277 \)
Loss	\( \mathcal{L} \approx 1.463 \)

8. Conclusion

This walk-through shows how backpropagation works at a low level by breaking it into mathematical components and implementing each in Python. While high-level libraries like TensorFlow and PyTorch abstract this process, understanding backpropagation gives you deeper insights into how and why neural networks learn—and is essential for debugging, optimization, and research.

In future posts, we will extend this to multiple training steps, batches, and deeper networks with multiple layers and activation functions. But as always, true mastery begins with simplicity.

My Research Notes

Saturday, 24 May 2025