To put it merely, soft attention is deterministic while hard attention isn't. Here is an example of a model generating a caption using the attention mechanism:
- The top row shows however soft attention was used to deduce what the model is that specialize in, the bottom row shows the same thing but with hard attention. Each time the model infers a replacement word within the caption, it will produce an attention map (it’s actually a probability density function with sum = 1). The black-white overlay you see within the high row is that the density map. Lighter = higher value in the function.
- With soft attention, you multiply this attention map over the image feature map (produce by feeding the image through a convolutional neural network) and sum it up. This makes features within the targeted regions (the bright regions) dominate other irrelevant features in this time-step (the dark regions).
- For hard attention, you still have a density function but now you only sample one or two features on the feature map. The sampling is of course done accordingly to the probability density function. Because the focusing region is completed by sampling, it’s of course not deterministic. However, we will still expect that this region contains a higher probability to be targeted on than the others.
One thing you should note is that the function which attention model uses might not be the same function used by soft attention.
This example is taken from a computer vision problem however it’s typically identical for different varieties of problems similarly. I used this here because it’s easier to visualize.
Thus, Machine Learning Course is an important prospect of solving this problem.
Image is taken from this paper: https://arxiv.org/pdf/1502.03044...