Understanding Backpropagation Through a Basic Neural Network Example
By Priyank Goyal
In this article, we explore the foundational workings of a feedforward neural network with a single hidden layer. Our goal is to demystify the process of forward propagation, loss computation, and especially backpropagation—the core algorithm that powers learning in neural networks. This walkthrough answers three essential questions:
- What is a basic neural network?
- How does backpropagation work in such a network?
- How can this process be translated into working Python code?
1. Anatomy of a Basic Neural Network
Let us consider a minimal neural network consisting of three layers:
- Input Layer with 3 neurons (features)
- Hidden Layer with 2 neurons and ReLU activation
- Output Layer with 1 neuron and sigmoid activation
The goal of this network is to perform binary classification. Each input vector \( x \in \mathbb{R}^3 \) is passed through the network, which outputs a probability \( \hat{y} \in [0, 1] \).
The network's computation can be summarized as:
\[ z_1 = W_1 x + b_1,\quad h = \text{ReLU}(z_1),\quad z_2 = W_2 h + b_2,\quad \hat{y} = \sigma(z_2) \]Where:
- \( W_1 \in \mathbb{R}^{2 \times 3} \), \( b_1 \in \mathbb{R}^{2 \times 1} \)
- \( W_2 \in \mathbb{R}^{1 \times 2} \), \( b_2 \in \mathbb{R} \)
- \( \text{ReLU}(z) = \max(0, z) \)
- \( \sigma(z) = \frac{1}{1 + e^{-z}} \) is the sigmoid function
2. Forward Propagation
Given the input vector:
\[ x = \begin{bmatrix} 1.0 \\ 0.5 \\ -1.5 \end{bmatrix} \]and initial parameters:
\[ W_1 = \begin{bmatrix} 0.2 & -0.4 & 0.1 \\ 0.7 & 0.3 & -0.5 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 0.5 & -1.0 \end{bmatrix}, \quad b_2 = 0.2 \]We compute:
\[ z_1 = W_1 x + b_1 = \begin{bmatrix} -0.15 \\ 1.8 \end{bmatrix},\quad h = \text{ReLU}(z_1) = \begin{bmatrix} 0 \\ 1.8 \end{bmatrix} \] \[ z_2 = W_2 h + b_2 = -1.6,\quad \hat{y} = \sigma(z_2) \approx 0.167 \]3. Loss Computation
Assuming the true label is \( y = 1 \), the binary cross-entropy loss is:
\[ \mathcal{L} = -\left[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\right] \approx 1.79 \]4. Backpropagation Step-by-Step
We now compute the gradients of the loss with respect to each parameter in reverse order, using the chain rule.
-
Output layer:
\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{1}{\hat{y}},\quad \frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1 - \hat{y}) \] \[ \delta_2 = \frac{\partial \mathcal{L}}{\partial z_2} \approx -0.83 \] \[ \frac{\partial \mathcal{L}}{\partial W_2} = \delta_2 \cdot h^T,\quad \frac{\partial \mathcal{L}}{\partial b_2} = \delta_2 \] -
Hidden layer:
\[ \delta_1 = (W_2^T \delta_2) \circ f'(z_1),\quad \text{where ReLU}'(z) = 1 \text{ if } z > 0, 0 \text{ otherwise} \] \[ \frac{\partial \mathcal{L}}{\partial W_1} = \delta_1 \cdot x^T,\quad \frac{\partial \mathcal{L}}{\partial b_1} = \delta_1 \]
5. Python Implementation
The following code performs a full forward and backward pass on the network and updates weights using gradient descent:
import numpy as np
x = np.array([[1.0], [0.5], [-1.5]])
y_true = 1
W1 = np.array([[0.2, -0.4, 0.1], [0.7, 0.3, -0.5]])
b1 = np.array([[0.1], [-0.2]])
W2 = np.array([[0.5, -1.0]])
b2 = np.array([[0.2]])
lr = 0.1
# Forward
z1 = W1 @ x + b1
h = np.maximum(0, z1)
z2 = W2 @ h + b2
y_pred = 1 / (1 + np.exp(-z2))
loss = - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Backward
dL_dz2 = -(y_true / y_pred) * (y_pred * (1 - y_pred))
dL_dW2 = dL_dz2 @ h.T
dL_db2 = dL_dz2
dL_dh = W2.T @ dL_dz2
dL_dz1 = dL_dh * (z1 > 0)
dL_dW1 = dL_dz1 @ x.T
dL_db1 = dL_dz1
# Update
W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1
6. Final Updated Parameters
| Parameter | Updated Value |
|---|---|
W1 |
\[ \begin{bmatrix} 0.2 & -0.4 & 0.1 \\ 0.623 & 0.262 & -0.385 \end{bmatrix} \] |
b1 |
\[ \begin{bmatrix} 0.1 \\ -0.277 \end{bmatrix} \] |
W2 |
\[ \begin{bmatrix} 0.5 & -0.892 \end{bmatrix} \] |
b2 |
\( 0.277 \) |
| Loss | \( \mathcal{L} \approx 1.463 \) |
7. Conclusion
This example illustrates how a basic neural network performs forward and backward propagation. By following the chain rule through each layer, we calculate how much each parameter contributes to the output error. Backpropagation enables the network to learn from its mistakes by adjusting its weights and biases through gradient descent.
Once you understand this foundation, you're ready to explore deeper networks, regularization, batch training, and advanced optimizers like Adam and RMSprop. But remember, everything starts here—with a dot product, a ReLU, and a sigmoid.
No comments:
Post a Comment