The essentials of activations, gradients, and batch normalization

8 min readOct 14, 2024

Hello everyone. As you know, I’m watching the lectures by Andrej Karpathy on building neural networks from scratch.

In this post, I’ll share the highlights from lecture 4 on activations, gradients, and batch normalization. These topics are crazy important and often hard to understand. I find it fairly impossible to find a better teacher than him, so do watch his lecture!

Let’s start.

The problem of initialization

Before all the nice and easy-to-use deep learning libraries such as PyTorch and Tensorflow, things were very hard for those who implemented neural networks.

That is because there are many details one has to account for so that the training process is achieved in the best possible way, and much of this has to do with properly initializing the network.

That means: instead of initializing the model’s parameters randomly, do some things that can help guide its learning process better. In addition, also apply normalization techniques. Both heavily contribute to achieving optimal performance. Previously, all of this was done manually, so imagine the burden. We will be going into these details in this post.

Scaling preactivations

So, let’s go. If we look at the loss curve from a poorly initialized network, it will kind of look like a hockey stick: extremely high at the start, then reduces abruptly.

The printed values of the loss across training rounds. You can notice it starts extremely high (27), then drops to something reasonable, around 2.8.

If you remember our predict-the-next-character model, that’s a bit weird. We would expect to see the untrained network assigning at least the same probability to all characters in the vocabulary, as all are initially as likely (e.g. 1/27, if there are 27 characters).

If we initialize the network with these probabilities instead and recalculate the first loss, it goes down to 3.2. The random initialization in the example results in logits with extreme values, messing up the predictions.

To avoid this, we can scale (multiply) the first weight matrix by, e.g., a factor of 0.01, resulting in smaller weights, which will then result in smaller logits. In the same logic, the bias can be initialized as a vector of zeroes.

We don’t set the weights to zero because that will cause problems due to probabilities being zero.

Ok, nice! But that isn’t all. In this step, we’ve just helped scale the first preactivations. What if the distribution of these preactivations has a high variance as it flows throughout the network? If the range of their distribution is too broad, this can cause issues when applying nonlinearities. Let’s see why.

Nonlinearity saturation

Nonlinear functions, also known as activation functions, are essential to model complex relations between inputs and outputs. They take the outputs from previous layers and squash them to a smaller range. In the case of tanh, that range is [-1, 1]:

The tanh activation function. Source: Tech-AI-Math.

Now, if the range of inputs is too broad, that transformation won’t work well because the nonlinearity will have to squash a large range of values into a much smaller one. This causes many inputs to be mapped to the extremes of the nonlinearity, either close to -1 (for smaller ones) or 1 (for larger ones).

The distribution of preactivations without proper initialization. You can note it ranges from [-15, 15], having a high variance.

The output of the tanh after processing the previous range of values. Most counts were mapped to the extremes (either close to -1 or 1), which is a huge issue!

When this happens, we say the nonlinearity is saturated.

Think of it like this: saturation is like adjusting a dimmer switch. If the switch is set too high or low, the light stays fully on or off, no matter how much more you twist.
So, if the input values have a broad distribution, they can push the function to these extremes, causing outputs to stick at -1 or 1.

These extremes in the nonlinear function are typically flat regions, meaning the gradients there are close to zero. If the gradients are zero, variations in input values no longer affect the loss.

The derivative of tanh(x) is 1 — tanh²(x). So, if tanh(x) is 1 or -1, the gradient will be approximately zero.

No gradients mean no weight updates, leading to a dead neuron: a neuron that will never learn.

Not only that, but if the gradient is zero in one neuron, it will stop any gradient flowing through that unit, cutting off the learning process for other neurons connected to it. This problem, known as vanishing gradients, is a nightmare for backpropagation.

The vanishing gradient problem can also occur during optimization. For example, if the learning rate is too high, some neurons might receive excessively large gradient updates, pushing them too far off the data manifold.
This knocks them out of the learning space, meaning no input examples will activate those neurons anymore — almost like a permanent brain injury.

Let’s look at this other view from the tanh output for the inputs we’ve just seen:

Columns show a single neuron’s outputs for each input. If an entire column is white, this represents a dead neuron.

To solve this, we can do the same as we did in the initial layer: if we push the preactivations to be close to zero by scaling the weight matrix from the hidden layers and setting the bias vector to 0 (or close to it, to add some entropy), we obtain much better histograms.

With less extreme inputs, tanh will have a smaller range to squash, leading to less saturation and, hence, fewer gradient kills.
I think you’ve got the point: we have to apply scaling to the whole sandwich of linear + nonlinear layers, making sure that each input is in an adequate range for the following layer.

An example of the preactivations after scaling. The [-15, 15] became [-3, 3], leading to a much better range of preactivations.

This improved initialization reduces the loss from the start. We can then spend more time on productive training rather than correcting bad initialization.

This is especially crucial for deep networks, where initialization errors stack up across layers (always visualize the training progress by plotting!).

Ok, but now we have a composed problem. As each output becomes another layer’s input, the scaling factors may vary.

Each layer in a network has different characteristics, such as the number of neurons and activation dynamics that influence how much variance should be passed forward.
Without careful tuning, one layer’s outputs might overwhelm the next or be too small to be useful, leading to inefficient learning or gradient issues.

So, how do we define proper scales, specific to each weight and bias across all layers, automatically? Let’s see.

Glorot and He initializations

The idea for both is to control the spread of preactivations to ensure that their standard deviation remains roughly 1 across layers. This helps the network maintain good training performance.

Think of it like this: the distributions of the preactivations are like a dough you have to spread. The same even thickness should be kept across layers, avoiding it being too thin or too thick.

For Glorot initialization, the weights are drawn from a normal or uniform distribution. The scaling factor of the weight matrices is then set to √(2 or 6/ (n_in + n_out)), where n_in is the number of inputs and n_out is the number of outputs of each layer. It works well with activation functions like tanh or sigmoid.

He initialization, on the other hand, was designed to address the needs of ReLU and similar activation functions because these non-linearities handle their inputs in a more aggressive way.

The Rectified Linear Unit (ReLU) activation function. Source: AILEPHANT.

For example, ReLU sets all negative values to zero, getting rid of half of the input distribution, which reduces the variance of activations a lot.
As a result, there’s a need to boost the weights to restore the variance and keep the standard deviation near 1.

This is where the gain factor comes in. It adjusts the weight initialization to compensate for the specific behavior of the activation function.

In the case of ReLU, the gain is √2. So He initialization will sample weights from a distribution with variance 2 / n_in, compensating for the fact that half of the activations are lost, helping to maintain the flow of information.
You can also use He initialization with other nonlinearities such as tanh, but the gain factor will be different, as it’s function-specific.

In PyTorch, you can find a table of gain factors for various nonlinearities.

Batch normalization

Ok, we’re almost done. All we discussed so far was focused on initialization. Now, how can we guarantee that the spread of preactivations will consistently have the same nice shape during training?

This is where batch normalization comes in. The solution is pretty simple: after calculating the preactivations, we can simply standardize them to have a mean of zero and a standard deviation of 1!

So, batch norm consists of calculating the mean and standard deviation of the preactivations in the batch and then standardizing each by subtracting the mean and dividing by the standard deviation. That’s it.

Remember, we might have e.g. 100,000 training instances, but we will be processing them in batches of 32 for efficiency purposes.

However, this is a bit of a stiff move for introducing in our network. As we still want to leave some room for it to move the distribution around, using backpropagation as a guide, we will be introducing scale and shift components.

The gain component (bngain; don’t mistake this for the previous gain) will allow scaling the preactivations, while the bias component (bnbias) will apply an offset, obtaining the final output of the layer. Both are initialized as 1 and 0, respectively.

The network then learns how to optimize both with training as they become part of the set of network parameters.

In practice, batch norm is applied after a linear or convolutional layer, effectively controlling the scale of preactivations. Now preactivations (and hence logits) are a function of the other examples existing in the batch (sampled randomly).

This is actually good for training, as it acts like a regularizer. For example, padding out examples and introducing entropy (like data augmentation of the input) makes the network less prone to overfit to the examples.

Ok, but how can we feed a single example to a trained network that was trained with batch norm? This can be solved by manually calculating the mean and standard deviation from the training set and using these during inference. In practice, this isn’t done.

Instead, two buffers, a running bmean and a running bvariance, are calculated on the side during training, updating these statistics across multiple mini-batches.

How much is updated is controlled with a momentum parameter (like a “learning rate”). Both are then used at inference time, so you don’t have to calculate it manually.

Another detail: as standardization removes the effect of any bias that could’ve been added in the previous layer, we usually set the bias off from them and only keep the bnbias from the batchnorm layer.

Here’s a visual summary:

How a batchnorm layer works. A batch of features is the input. The mean and standard deviation are calculated based on the batch. Then the buffers (“moving average”) are updated given the batch stats and a momentum parameter. Each feature (Â) is then normalized, scaled, and shifted by the gain and the bias, generating the outputs. Source: Ketan Doshi.

And that’s all! See you next time.