Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in Machine Learning by (4.2k points)

In this blog post, The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy mentions future directions for neural networks based machine learning:

The concept of attention is the most interesting recent architectural innovation in neural networks. [...] soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly). Think of this as declaring a pointer in C that doesn't point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!). This has motivated multiple authors to swap soft attention models for hard attention where one samples a particular chunk of memory to attend to (e.g. a read/write action for some memory cell instead of reading/writing from all cells to some degree). This model is significantly more philosophically appealing, scalable and efficient, but unfortunately it is also non-differentiable.

I think I understood the pointer metaphor, but what is exactly attention and why is the hard one not differentiable?

I found an explanation about attention here, but still confused about the soft/hard part.

1 Answer

+1 vote
by (6.8k points)

To put it merely, soft attention is deterministic while hard attention isn't. Here is an example of a model generating a caption using the attention mechanism:

image

  • The top row shows however soft attention was used to deduce what the model is that specialize in, the bottom row shows the same thing but with hard attention. Each time the model infers a replacement word within the caption, it will produce an attention map (it’s actually a probability density function with sum = 1). The black-white overlay you see within the high row is that the density map. Lighter = higher value in the function.
  • With soft attention, you multiply this attention map over the image feature map (produce by feeding the image through a convolutional neural network) and sum it up. This makes features within the targeted regions (the bright regions) dominate other irrelevant features in this time-step (the dark regions).
  • For hard attention, you still have a density function but now you only sample one or two features on the feature map. The sampling is of course done accordingly to the probability density function. Because the focusing region is completed by sampling, it’s of course not deterministic. However, we will still expect that this region contains a higher probability to be targeted on than the others.

One thing you should note is that the function which attention model uses might not be the same function used by soft attention.

This example is taken from a computer vision problem however it’s typically identical for different varieties of problems similarly. I used this here because it’s easier to visualize.

Thus, Machine Learning Course is an important prospect of solving this problem.

Image is taken from this paper: https://arxiv.org/pdf/1502.03044...

Browse Categories

...