Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

How is the convolution operation carried out when multiple channels are present at the input layer? (e.g. RGB)

After doing some reading on the architecture/implementation of a CNN I understand that each neuron in a feature map references NxM pixels of an image as defined by the kernel size. Each pixel is then factored by the feature maps learned NxM weight set (the kernel/filter), summed, and input into an activation function. For a simple greyscale image, I imagine the operation would be something to adhere to the following pseudo-code:

for i in range(0, image_width-kernel_width+1): 

for j in range(0, image_height-kernel_height+1): 

for x in range(0, kernel_width): 

for y in range(0, kernel_height): 

sum += kernel[x,y] * image[i+x,j+y] 

feature_map[i,j] = act_func(sum) 

sum = 0.0

However, I don't understand how to extend this model to handle multiple channels. Are three separate weight sets required per feature map, shared between each color?

Referencing this tutorial's 'Shared Weights' section: Each neuron in a feature map references layer m-1 with colors being referenced from separate neurons. I don't understand the relationship they are expressing here. Are the neurons kernels or pixels and why do they reference separate parts of the image?

Based on my example, it would seem that a single neurons kernel is exclusive to a particular region in an image. Why have they split the RGB component over several regions?

1 Answer

0 votes
by (108k points)

Answering your first question, in such a case, you have one 2D kernel per input channel (plane).

So you perform each convolution (2D Input, 2D kernel) separately and you sum the contributions which give the final output feature map.

Referring to your second question, yes they share the same weights between each color.

If you consider a given output feature map, you have 3 x 2D kernels (i.e one kernel per input channel). Each 2D kernel shares the same weights along the whole input channel (R, G, or B here).

So the whole convolutional layer is a 4D-tensor (nb. input planes x nb. output planes x kernel width x kernel height).

Why have they split the RGB component over several regions?

They split so that they can have separate input plane and weights.

Interested in learning Artificial Intelligence? Learn more from this AI Course!

Browse Categories