From Linear Limits to Nonlinear Power: Why Activation Functions Matter in Neural Networks

One of the most pivotal reasons behind the success of deep learning lies not in its depth alone, but in the introduction of non-linear activation functions between layers. This article explores fundamental questions about the role and necessity of non-linearity in neural networks, using ReLU, sigmoid, and tanh as primary examples. We also explore how this affects gradient computation through the lens of the Jacobian matrix.

1. Why Can't We Just Stack Linear Layers?

Suppose we have a series of layers in a neural network where each transformation is linear. That is, each layer performs:

z = W @ x + b

Now, suppose we stack multiple such layers:

y = W3 @ (W2 @ (W1 @ x + b1) + b2) + b3

This entire operation can be collapsed into a single linear transformation:

\[ y = W'x + b' \]

The problem? This is still just a linear function. It doesn’t matter how many layers we stack — we’re still mapping straight lines to straight lines. In mathematical terms, we have not increased the function space the network can represent.

Such a network is incapable of modeling relationships that are nonlinear in nature, such as:

Nonlinear classification boundaries (e.g., XOR problem)
Complex visual features (edges, curves, textures)
Compositional hierarchy in language or images

2. How Activation Functions Introduce Non-Linearity

To break free of this limitation, we insert a non-linear activation function \( f \) after each linear transformation:

\[ h = f(z) = f(Wx + b) \]

These non-linearities enable the model to approximate arbitrary continuous functions — a fact supported by the Universal Approximation Theorem.

2.1 ReLU (Rectified Linear Unit)

\[ f(x) = \max(0, x) \]

Piecewise linear but not globally linear.
Introduces a "kink" at \( x = 0 \), breaking the straight-line assumption.
Efficient and widely used in deep networks due to simplicity and sparsity.

ReLU: sharp transition at 0, linear growth for positive inputs.

2.2 Sigmoid

\[ f(x) = \frac{1}{1 + e^{-x}} \]

S-shaped curve mapping real values to \( (0, 1) \).
Useful in binary classification problems.
Non-zero centered output, which can lead to vanishing gradient issues.

Sigmoid: saturates at both ends, making gradients small when far from 0.

2.3 Tanh (Hyperbolic Tangent)

\[ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \tanh(x) \]

S-shaped curve like sigmoid but zero-centered.
Preferred in networks where zero-mean activation is helpful.

Tanh: similar saturation but centered around zero, which helps convergence.

These characteristics influence the training dynamics and stability of neural networks.

3. Gradient Flow and the Jacobian Matrix

When applying activation functions elementwise, the gradient of the output \( \mathbf{h} = f(\mathbf{z}) \) with respect to \( \mathbf{z} \) is captured by the Jacobian matrix:

\[ \frac{\partial h_i}{\partial z_j} = \begin{cases} f'(z_i), & \text{if } i = j \\ 0, & \text{if } i \ne j \end{cases} \]

This results in a diagonal matrix:

\[ \frac{\partial \mathbf{h}}{\partial \mathbf{z}} = \text{diag}(f'(z_1), f'(z_2), \ldots, f'(z_n)) \]

Understanding this Jacobian is essential for backpropagation, as it determines how the loss gradient flows back through the activation function and affects earlier layers.

4. Summary: Why Non-Linearity is Indispensable

In deep learning, depth alone does not bring power. Non-linearity is the true enabler of expressiveness. Without it:

Deep networks are just shallow networks in disguise.
They can't solve problems with curved or irregular boundaries.
No compositional or hierarchical abstraction is possible.

With non-linear activation functions like ReLU, tanh, and sigmoid:

Neural networks can model complex patterns in data.
Gradient flow becomes meaningful due to differentiable non-linearities.
The expressiveness of the function space grows exponentially.

Further Exploration

Visualize the derivative of each activation function to understand gradient strength.
Experiment with deeper networks with and without non-linearities to see the difference.
Explore newer activations like Leaky ReLU, GELU, or Swish.

Non-linearity is not a "hack" — it’s the mathematical heart of why deep learning works.

Footnote:

¹ In Python, the @ symbol represents the matrix multiplication operator. It is equivalent to the dot product or linear transformation used in linear algebra. For example, in NumPy:

import numpy as np

A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])

C = A @ B  # Matrix multiplication

This is mathematically equivalent to:

\[ C = AB = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \cdot \begin{bmatrix} 5 & 6 \\ 7 & 8 \\ \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \\ \end{bmatrix} \]

This operator is essential in neural networks, where each layer performs a linear transformation via matrix multiplication followed by a non-linear activation.

My Research Notes

Saturday, 24 May 2025