Batch normalization has become a key technique for training deep neural networks. But what exactly does it do and why is it so important? In this post, we’ll explain batch normalization using an analogy, walk through the technical details, and cover the math behind it.

The Running Analogy Imagine you are coaching a team of runners training for a marathon. Each day your runners complete workouts and you record their interval times.

You notice their times vary significantly day-to-day. On hot 30°C days, they run slowly. But on cooler 15°C days, they run much quicker intervals.

The problem is these time variations aren’t from fitness gains, but external factors like weather. When you analyze the data, the intervals across days don’t align properly. This makes it hard to track real improvements.

This is the same problem batch normalization aims to solve! Each batch of data fed into a neural network can have a different distribution. Batch normalization normalizes the data so that each batch has the same distribution as training progresses.

## How Batch Normalization Works

In a neural network, batch normalization normalizes the output of a previous activation layer for each batch. It does this by:

Calculating the mean and variance of a batch

Normalizing using the formula: (x - mean) / sqrt(variance + epsilon)

Scaling by a learnable parameter gamma

Shifting by a learnable parameter beta

This process stabilizes the distribution across batches. The network learns the optimal scaling (gamma) and shifting (beta) during training. Applying batch normalization helps the network learn faster and perform better.

## The Math Behind Batch Normalization

Here is the general mathematical formulation for batch normalization:

Given a batch B = {x1, x2, ..., xm} of m examples

Calculate mean μB = 1/m * Σmi=1 xi

Calculate variance σB^2 = 1/m * Σmi=1(xi - μB)2

Normalize each example xi using: xi_norm = (xi - μB) / sqrt(σB^2 + epsilon)

Scale and shift: yi = γ * xi_norm + β

Where γ and β are learned parameters that scale and shift the normalized value. ε is a small constant for numerical stability.

This sequence of operations reduces internal covariate shift within the network and makes the training process more robust. Batch normalization has become a standard technique used by state-of-the-art neural networks.

## Conclusion

Using the running analogy, technical details, and math, we've covered how batch normalization works and why it's so useful. Normalizing activations enables faster training and better performance. Now you have an intuitive understanding of this key neural network technique!