You should understand that the output of the **convolutional layer **is a 4-rank tensor. [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels.

**You can use batchnorm: **

It** **normalizes the activations of the previous layer at each batch, which means it applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

**For example:**

mean = mean(t, axis=0)

stddev = stddev(t, axis=0)

for i in 0..B-1:

out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)

The above code returns the computes of **H*W*C mean **and **H*W*C standard deviations **across B elements.

**Batchnorm in convolutional layer:**

It consists of filter weights that are shared across the input image. That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.

**For example:**

mean = mean(t, axis=(0, 1, 2))

stddev = stddev(t, axis=(0, 1, 2))

for i in 0..B-1, x in 0..H-1, y in 0..W-1:

out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)

In the above code, there are only C means and standard deviations and each one of them is computed over B*H*W values. In "effective mini-batch": the difference between the two is only in axis selection.

Hope this answer helps.