I have an assignment that involves introducing generalization to the network with one hidden ReLU layer using L2 loss. I wonder how to properly introduce it so that ALL weights are penalized, not only weights of the output layer.
Code for a network without generalization is at the bottom of the post (code to actually run the training is out of the scope of the question).
Obvious way of introducing the L2 is to replace the loss calculation with something like this (if beta is 0.01):
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels) + 0.01*tf.nn.l2_loss(out_weights))
But in such a case, it will take into account the values of the output layer's weights. I am not sure, how do we properly penalize the weights which come INTO the hidden ReLU layer. Is it needed at all or introducing penalization of the output layer will somehow keep the hidden weights in check also?