2 views

I am a newbie in convolutional neural networks and just have an idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalization on CNN.

I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end, they mentioned that a slight modification is required when applied to CNN:

For convolutional layers, we additionally want the normalization to obey the convolutional property – so those different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effective mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.

I am total confused when they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.

I could not understand the below sentence at all "In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"

I would be glad if someone could elaborate and explain me in much simpler terms

by (33.1k points)

You should understand that the output of the convolutional layer is a 4-rank tensor. [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels.

You can use batchnorm:

It normalizes the activations of the previous layer at each batch, which means it applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

For example:

mean = mean(t, axis=0)

stddev = stddev(t, axis=0)

for i in 0..B-1:

out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)

The above code returns the computes of H*W*C mean and H*W*C standard deviations across B elements.

Batchnorm in convolutional layer:

It consists of filter weights that are shared across the input image. That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.

For example:

mean = mean(t, axis=(0, 1, 2))

stddev = stddev(t, axis=(0, 1, 2))

for i in 0..B-1, x in 0..H-1, y in 0..W-1:

out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)

In the above code, there are only C means and standard deviations and each one of them is computed over B*H*W values. In "effective mini-batch": the difference between the two is only in axis selection.