I successfully built and train the network and introduced the L2 regularization on all weights and biases. Right now I am trying out the dropout for hidden layer in order to improve generalization. I wonder, does it makes sense to both introduce the L2 regularization into the hidden layer and dropout on that same layer? If so, how to do this properly?

During dropout, we literally switch off half of the activations of the hidden layer and double the amount outputted by the rest of the neurons. While using the L2 we compute the L2 norm on all hidden weights. But I am not sure how to compute L2 in case we use dropout. We switch off some activations, shouldn't we remove the weights which are 'not used' now from the L2 calculation? Any references on that matter will be useful, I haven't found any info.

Just in case you are interested, my code for ANN with L2 regularization is below:

#for NeuralNetwork model code is below

#We will use SGD for training to save time. Code is from Assignment 2

#beta is the new parameter - controls the level of regularization. Default is 0.01

#but feel free to play with it

#notice, we introduce L2 for both biases and weights of all layers

beta = 0.01

#building tensorflow graph

graph = tf.Graph()

with graph.as_default():

Input data. For the training data, we use a placeholder that will be fed at run time with a training minibatch.

tf_train_dataset = tf.placeholder(tf.float32,

shape=(batch_size, image_size * image_size))

tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))

tf_valid_dataset = tf.constant(valid_dataset)

tf_test_dataset = tf.constant(test_dataset)

Now let's build our new hidden layer that's how many hidden neurons we want

num_hidden_neurons = 1024

it's weights

hidden_weights = tf.Variable(

tf.truncated_normal([image_size * image_size, num_hidden_neurons]))

hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))

Now the layer itself. It multiplies data by weights, adds biases

and takes ReLU over the result

hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)

time to go for an output linear layer

out weights connect hidden neurons to output labels

biases are added to output labels

out_weights = tf.Variable(

tf.truncated_normal([num_hidden_neurons, num_labels]))

out_biases = tf.Variable(tf.zeros([num_labels]))

compute output

out_layer = tf.matmul(hidden_layer,out_weights) + out_biases

#our real output is a softmax of prior result

#and we also compute its cross-entropy to get our loss

#Notice - we introduce our L2 here

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(

out_layer, tf_train_labels) +

beta*tf.nn.l2_loss(hidden_weights) +

beta*tf.nn.l2_loss(hidden_biases) +

beta*tf.nn.l2_loss(out_weights) +

beta*tf.nn.l2_loss(out_biases)))

#now we just minimize this loss to actually train the network

optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

#nice, now let's calculate the predictions on each dataset for evaluating the

#performance so far

# Predictions for the training, validation, and test data.

train_prediction = tf.nn.softmax(out_layer)

valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)

valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)

test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)

test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)

#now is the actual training on the ANN we built

#we will run it for some number of steps and evaluate the progress after

#every 500 steps

#number of steps we will train our ANN

num_steps = 3001

#actual training

with tf.Session(graph=graph) as session:

tf.initialize_all_variables().run()

print("Initialized")

for step in range(num_steps):

# Pick an offset within the training data, which has been randomized.

# Note: we could use better randomization across epochs.

offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

# Generate a minibatch.

batch_data = train_dataset[offset:(offset + batch_size), :]

batch_labels = train_labels[offset:(offset + batch_size), :]

# Prepare a dictionary telling the session where to feed the minibatch.

# The key of the dictionary is the placeholder node of the graph to be fed,

# and the value is the numpy array to feed to it.

feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}

_, l, predictions = session.run(

[optimizer, loss, train_prediction], feed_dict=feed_dict)

if (step % 500 == 0):

print("Minibatch loss at step %d: %f" % (step, l))

print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))

print("Validation accuracy: %.1f%%" % accuracy(

valid_prediction.eval(), valid_labels))

print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))