What is the intuition of using tanh in LSTM

Question

1 Answer

Anurag · Answer 1 · 2019-07-05T14:04:25+0000

There are many activation functions in machine learning. I explained some most commonly used activation functions.

Sigmoid is used as the gating function for the 3 gates(in, out, forget) in LSTM, because it outputs a value between 0 and 1, there can be either no flow or complete flow of information throughout the gates. To overcome the vanishing gradient problem, we need a method whose second derivative can sustain for a long range before going to zero. Tanh is a good function that has all the above properties.

A neuron unit should be bounded, easily differentiable, monotonic and easy to handle. You can use the ReLU function in place of the tanh function. Before changing the choice for activation functions, you must know what are the advantages and disadvantages of your choice over others.

Sigmoid formula:

Sigmoid(z) = 1 / (1 + exp(-z))

1st order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages

Sigmoid function has all the fundamental properties of a good activation function.

Tanh formula:

Mathematical expression:

tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

1st order derivative:

tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice

(2) Gradient computation is less expensive

Hard Tanh formula:

Mathematical expression:

hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

1st order derivative:

hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh

(2) Saturate for magnitudes of z greater than 1

ReLU formula

Mathematical expression:

relu(z) = max(z, 0)

1st order derivative:

relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z

(2) Found much success in computer vision applications

Leaky ReLU

Mathematical expression:

leaky(z) = max(z, k dot z) where 0 < k < 1

1st order derivative:

relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't

I hope this explanation helps.

What is the intuition of using tanh in LSTM

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources