You should understand that the output of the convolutional layer is a 4-rank tensor. [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels.
You can use batchnorm:
It normalizes the activations of the previous layer at each batch, which means it applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
The above code returns the computes of H*W*C mean and H*W*C standard deviations across B elements.
Batchnorm in convolutional layer:
It consists of filter weights that are shared across the input image. That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In the above code, there are only C means and standard deviations and each one of them is computed over B*H*W values. In "effective mini-batch": the difference between the two is only in axis selection.
Hope this answer helps.