Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

When trying to get cross entropy with sigmoid activation function, there is a difference between

loss1 = -tf.reduce_sum(p*tf.log(q), 1)

loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1)

But they are the same when with softmax activation function.

Following is the sample code:

import tensorflow as tf

sess2 = tf.InteractiveSession()

p = tf.placeholder(tf.float32, shape=[None, 5])

logit_q = tf.placeholder(tf.float32, shape=[None, 5])

q = tf.nn.sigmoid(logit_q)

sess.run(tf.global_variables_initializer())

feed_dict = {p: [[0, 0, 0, 1, 0], [1,0,0,0,0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]]}

loss1 = -tf.reduce_sum(p*tf.log(q),1).eval(feed_dict)

loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1).eval(feed_dict)

print(p.eval(feed_dict), "\n", q.eval(feed_dict))

print("\n",loss1, "\n", loss2)

1 Answer

0 votes
by (33.1k points)

You need to understand the cross-entropy for binary and multi-class problems.

Multi-class cross-entropy

Your formula is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits.

For example:

-tf.reduce_sum(p * tf.log(q), axis=1)

p and q are expected for probability distributions over N classes. In particular, N can be 2, as in the following example:

p = tf.placeholder(tf.float32, shape=[None, 2])

logit_q = tf.placeholder(tf.float32, shape=[None, 2])

q = tf.nn.softmax(logit_q)

feed_dict = {

  p: [[0, 1],

      [1, 0],

      [1, 0]],

  logit_q: [[0.2, 0.8],

            [0.7, 0.3],

            [0.5, 0.5]]

}

prob1 = -tf.reduce_sum(p * tf.log(q), axis=1)

prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q)

print(prob1.eval(feed_dict))  # [ 0.43748799 0.51301527 0.69314718]

print(prob2.eval(feed_dict))  # [ 0.43748799 0.51301527 0.69314718]

Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

Binary cross-entropy

This correct formula for this problem is

p * -tf.log(q) + (1 - p) * -tf.log(1 - q)

But it's a case of the multi-class case, the meaning of p and q is different here. Each p and q is a number, corresponding to a probability of class A.

The common part is the p * -tf.log(q) part and the sum. p was a one-hot vector, it can a number, zero or one. Same for q - it was a probability distribution, now's it's a number (probability).

If p is a vector, then each individual component is an independent binary classification. This answer outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0] doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2] means that each of 5 features is on with 20% probability.

The goal of the sigmoid function is to squash the logit to [0, 1] interval.

This formula can also be used for multiple independent features, and that's what tf.nn.sigmoid_cross_entropy_with_logits computes:

p = tf.placeholder(tf.float32, shape=[None, 5])

logit_q = tf.placeholder(tf.float32, shape=[None, 5])

q = tf.nn.sigmoid(logit_q)

feed_dict = {

  p: [[0, 0, 0, 1, 0],

      [1, 0, 0, 0, 0]],

  logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2],

            [0.3, 0.3, 0.2, 0.1, 0.1]]

}

prob1 = -p * tf.log(q)

prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q)

prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q))

prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q)

print(prob1.eval(feed_dict))

print(prob2.eval(feed_dict))

print(prob3.eval(feed_dict))

print(prob4.eval(feed_dict))

You can notice that the last three tensors are equal, while the prob1 is only a part of cross-entropy, so it contains the correct value only when p is 1:

[[ 0.          0. 0.         0.59813893 0. ]

 [ 0.55435514  0. 0.          0. 0. ]]

[[ 0.79813886  0.79813886 0.79813886  0.59813887 0.79813886]

 [ 0.5543552   0.85435522 0.79813886  0.74439669 0.74439669]]

[[ 0.7981388   0.7981388 0.7981388   0.59813893 0.7981388 ]

 [ 0.55435514  0.85435534 0.7981388   0.74439663 0.74439663]]

[[ 0.7981388   0.7981388 0.7981388   0.59813893 0.7981388 ]

 [ 0.55435514  0.85435534 0.7981388   0.74439663 0.74439663]]

If you take the sum of -p * tf.log(q) along axis=1, that doesn't make sense in this syntax, so it'd be a valid formula in multi-class case.

Hope this answer helps

Learn TensorFlow with the help of this TensorFlow Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...