In LSTM Network (Understanding LSTMs), Why input gate and output gate use tanh? what is the intuition behind this? it is just a nonlinear transformation? if it is, can I change both to another activation function (e.g. ReLU)?

Login

0 votes

There are many **activation functions** in machine learning. I explained some most commonly used activation functions.

**Sigmoid** is used as the gating function for the 3 gates(in, out, forget) in LSTM, because it outputs a value between 0 and 1, there can be either no flow or complete flow of information throughout the gates. To overcome the vanishing gradient problem, we need a method whose second derivative can sustain for a long range before going to zero. Tanh is a good function that has all the above properties.

**A neuron unit** should be bounded, easily differentiable, monotonic and easy to handle. You can use the **ReLU** **function **in place of the **tanh** **function**. Before changing the choice for activation functions, you must know what are the advantages and disadvantages of your choice over others.

**Sigmoid formula:**

Sigmoid(z) = 1 / (1 + exp(-z))

1st order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages

Sigmoid function has all the fundamental properties of a good activation function.

**Tanh formula:**

Mathematical expression:

tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

*1st order derivative*:

tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice

(2) Gradient computation is less expensive

**Hard Tanh formula:**

Mathematical expression:

hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

1st order derivative:

hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh

(2) Saturate for magnitudes of z greater than 1

**ReLU formula**

Mathematical expression:

relu(z) = max(z, 0)

1st order derivative:

relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z

(2) Found much success in computer vision applications

**Leaky ReLU**

Mathematical expression:

leaky(z) = max(z, k dot z) where 0 < k < 1

1st order derivative:

relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't

I hope this explanation helps.