Common causes of nans during training

Question

1 Answer

Shrutiparna · Answer 1 · 2019-05-31T05:17:15+0000

There can be many causes for NAN S to occur during training, below are a few causes which I know:

Gradient blow up

It occurs when large gradients make the learning process off-track.

To resolve:

Decrease the base_lr at least by an order of magnitude. In case, you have several loss layers then just inspect the log to find which layer is causing the gradient to blow up and then decrease the loss_weight for that specific layer.

Bad learning rate policy and params

It occurs when caffe sometime fails to compute a valid learning rate and gets ‘inf’ or ‘NAN’ instead.

To resolve:

Fix all the parameters which are affecting the learning rate in the solver.prototxt file.

Faulty loss function

The computation of the loss in the loss layers may cause NAN to appear.

To resolve:

Add printout to the loss layer and debug the error.

Faulty input

It may also be caused if you have an input with NAN in it. When the learning process hits the faulty input, the output becomes NAN.

To resolve:

You can re-built the input datasets and ensure that your validation set does not have bad image files. You can also build a simple net which would read the input layer and would run through all the inputs and if it finds any one of them faulty, it will produce a Nan and then you can remove the inputs which are causing it.

Hope this helps!

Common causes of nans during training

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources