There can be many causes for NAN S to occur during training, below are a few causes which I know:
Gradient blow up
It occurs when large gradients make the learning process off-track.
To resolve:
Decrease the base_lr at least by an order of magnitude. In case, you have several loss layers then just inspect the log to find which layer is causing the gradient to blow up and then decrease the loss_weight for that specific layer.
Bad learning rate policy and params
It occurs when caffe sometime fails to compute a valid learning rate and gets ‘inf’ or ‘NAN’ instead.
To resolve:
Fix all the parameters which are affecting the learning rate in the solver.prototxt file.
Faulty loss function
The computation of the loss in the loss layers may cause NAN to appear.
To resolve:
Add printout to the loss layer and debug the error.
Faulty input
It may also be caused if you have an input with NAN in it. When the learning process hits the faulty input, the output becomes NAN.
To resolve:
You can re-built the input datasets and ensure that your validation set does not have bad image files. You can also build a simple net which would read the input layer and would run through all the inputs and if it finds any one of them faulty, it will produce a Nan and then you can remove the inputs which are causing it.
Hope this helps!