+1 vote
1 view

while digging through the topic of neural networks and how to efficiently train them I came across the method of using very simple activation functions, such as the recified linear unit (ReLU), instead of the classic smooth sigmoids. The ReLU-function is not differentiable at the origin, so according to my understanding the backpropagation algorithm (BPA) is not suitable for training a neural network with ReLUs, since the chain rule of multivariable calculus refers to smooth functions only. However, none of the papers about using ReLUs that I read address this issue. ReLUs seem to be very effective and seem to be used virtually everywhere while not causing any unexpected behavior. Can somebody explain to me why ReLUs can be trained at all via the backpropagation algorithm?

by (6.8k points)

To understand however backpropagation is even possible with functions like ReLU you would like to grasp what's the foremost necessary property of derivative that creates backpropagation rule works thus well. This property is that :

f(x) ~ f(x0) + f'(x0)(x - x0)

If you treat x0 as the actual worth of your parameter at the instant - you'll be able to tell (knowing the worth of a price operate and it's derivative) however the value operate will behave when you change your parameters a little bit. This is the most crucial thing in backpropagation.

Because of the actual fact that computing price operates is crucial for a cost computation - you'll want your cost to operate to satisfy the property expressed higher than. It's easy to check that ReLU satisfy this property everywhere except a small neighborhood of 0. And this can be the sole problem with ReLU - the actual fact that we cannot use this property after we are near to zero.

To overcome that you simply might select the worth of ReLU derivative in zero to either one or zero. On the opposite hand, most of the researchers do not treat this problem as serious just because of the actual fact, that being close to 0 during ReLU computations is relatively rare.

From the higher than - in fact - from the pure mathematical purpose of reading it is not plausible to use ReLU with backpropagation rule. On the opposite hand - in observing it always does not create any distinction that it's this weird behavior around zero.

Go through the Backpropagation Algorithm for more details on this segment.