Explore Courses Blog Tutorials Interview Questions
+1 vote
in Machine Learning by (4k points)
edited by

It's noticed by me that introduction of NAN S has been occurring frequently in training. 

I think that it's introduced because of weights in fully-connected/inner-product or convolution layer blowing up.

So what is the reason behind the occurrence of NAN, is it because gradient computations blowing up or caused by the input data's nature or because of weight initialization (If this is the reason then why weight initialization have this much effect)? 

Hence, What is the most probable reason behind NANs occurring in the training? And what are some methods to fight this also when these methods are used?

1 Answer

0 votes
by (10.9k points)
edited by

There can be many causes for NAN S to occur during training, below are a few causes which I know:

Gradient blow up

It occurs when large gradients make the learning process off-track.

To resolve:

Decrease the base_lr at least by an order of magnitude. In case, you have several loss layers then just inspect the log to find which layer is causing the gradient to blow up and then decrease the loss_weight for that specific layer. 

Bad learning rate policy and params

It occurs when caffe sometime fails to compute a valid learning rate and gets ‘inf’ or ‘NAN’ instead. 

To resolve: 

Fix all the parameters which are affecting the learning rate in the solver.prototxt file.

Faulty loss function

The computation of the loss in the loss layers may cause NAN to appear. 

To resolve:

Add printout to the loss layer and debug the error.

Faulty input

It may also be caused if you have an input with NAN in it. When the learning process hits the faulty input, the output becomes NAN. 

To resolve:

You can re-built the input datasets and ensure that your validation set does not have bad image files. You can also build a simple net which would read the input layer and would run through all the inputs and if it finds any one of them faulty, it will produce a Nan and then you can remove the inputs which are causing it.

Hope this helps!

Browse Categories