The following steps explain why using the softmax function on the hidden layer is not a good idea:
1. Variables independence: A lot of regularization and effort is required to keep your variables independent, uncorrelated and quite sparse. If you use the softmax layer as a hidden layer, then you will keep all your nodes linearly dependent which may result in many problems and poor generalization.
2. Training issues: if your network is working better, you have to make a part of activations from your hidden layer a little bit lower. Here automatically you are making the rest of them have mean activation on a higher level which might, in fact, increase the error and harm your training phase.
3. Mathematical issues: If you create constraints on activations of your model you decrease the expressive power of your model without any logical explanation.
4. Batch normalization does it better: You may consider the fact that mean output from a network may be useful for training. But on the other hand, a technique called Batch Normalization has been already proven to work better, but it was reported that setting softmax as the activation function in a hidden layer may decrease the accuracy and speed of learning.
A basic context will be provided through studying Machine Learning Online Course In Intellipaat.
Hope this answer helps you!
Watch this video to learn about Neural Networks Tutorials: