Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, with no hidden layers.

First I tried to implement it in this way. As you can see this is poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).

I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.

import tensorflow as tf

import numpy

train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])

train_Y = numpy.asarray([[0],[0],[0],[1]])

x = tf.placeholder("float",[None, 2])

y = tf.placeholder("float",[None, 1])

W = tf.Variable(tf.zeros([2, 1]))

b = tf.Variable(tf.zeros([1, 1]))

activation = tf.nn.sigmoid(tf.matmul(x, W)+b)

cost = tf.reduce_sum(tf.square(activation - y))/4

optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)

init = tf.initialize_all_variables()

with tf.Session() as sess:

    for i in range(5000):

        train_data =, feed_dict={x: train_X, y: train_Y})

    result =, feed_dict={x:train_X})


after 5000 iterations:

[[ 0.0031316 ]

[ 0.12012422]

[ 0.12012422]

[ 0.85576665]]

1 Answer

0 votes
by (33.1k points)

Activation functions

We have many activation functions with have different properties. An activation function used between two layers of a neural network is to serve as a nonlinearity.

Without an activation function, the two layers will work similar to one, because their effect will be just as a linear transformation. Most commonly, the sigmoid and tanh function is used as an activation function, but the Relu function is also being dominant nowadays because it is non-saturating and faster. 

The ReLU function returns the derivative 1 for all positive inputs, so the gradient for those neurons will not be changed by the activation unit at all and will not slow down the gradient descent.

The last layer of the network needs the activation unit. 

For regression, you can use the sigmoid or tanh activation, because you want the result to be between 0 and 1. 

For classification, you will want only one of your outputs to be one and all others zeros. So You can use a softmax to approximate it.

For example:

sigmoid(W1 * x1 + W2 * x2 + B)

W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1).

So the model that you are fitting is:

sigmoid(W * (x1 + x2) + B)

Your second example had converged better, because the softmax function is good at making precisely one output be equal to 1 and all others to 0. 

To choose which activation and cost functions to use, these advices will work for majority of cases:

  1. For classification, use softmax for the last layer's nonlinearity and cross-entropy as a cost function.

  2. For regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.

  3. Use ReLU as a nonlinearity between layers.

  4. Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence.

Browse Categories