07Th2

## Show, Attend and Tell – Soft and hard attention – An instance of Attention

### First Words

Recently, Attention, which is proposed at first by Bahdanau el at., has been implemented in an enormous number Artificial Intelligence (AI) projects in Pixta Vietnam. In this document, together, we will explore the story of Attention and the reasons why it is a fancy option for Pixta’s AI solutions. Before you go further in this article, we suggest that you should come back and read and review our introduction of Attention, which published last week on

As far as anyone knows, Attention was born at first to handle Machine Translation problems. However, as we mentioned in the first part of Attention series, Attention’s variations based on our imagination; and for us, Attention is carried out on tasks in Computer Vision. In order to accomplish our tasks, we inherit the brilliant idea that attention was implemented in Image Captioning from Kelvin el at. in Show, Attend and Tell. In fact, this idea is really valuable for us to study and apply in wide-range researched projects in the future.

Unfortunately, published literature are always not the best resources to find all the terms we need. Don’t worry. We will try to explain carefully the story behind Bahdanau el at. except the term of LSTM, which contributes to the process to decode the words output sequence.

Pay attention! Mathematics ahead. If you are an AI researcher, or an engineer, this is the article for you; if not, please read carefully background knowledge about Attention.

### Attention-based method for MNIST

To begin with, Let us give you some examples and evoke some definitions that might help you guys understand the literature much easier. As I mentioned above, the original Attention mechanism was proposed from 2014 by Bahdanau. Before you dive deeper into the next section in this document, once again, I recommend that you read carefully about the overview of the Attention mechanism written by Tony from LAB Team or take a look inside The blog explains Attention. Now, it ‘s time for my example.

If you live in AI land the above image looked familiar to you; it is a hand-writing image from the MNIST dataset. Please answer by yourself this question:Does the number 9 exists in the image?. If you say yes, please continue answering this question where is the number 9 located in the image?. A little bit confused, right. Well, in this case, some AI engineers might hind that using Object Detection helps you find the solution for this question. However, in practical, training the Object Detection model with our own objects or data set (not MNIST data set) you need an enormous amount of data. However, a truly nightmare is labeling the data. In addition, Object Detection, basically, has 2 types: 1-stage and 2-stage. 1-stage Object Detection (OD) works extremely fast to get the object label, region, and location but it is not affordable for us to extract the feature map of each object, which can be valuable to plug another model such as LSTM to do the further tasks. AI engineers might be familiar with 1-stage Object Detection in SSD or YOLO. On the other hand, 2-stage version of OD allows you to extract feature maps of each region and it performs pretty slow. Fortunately, a solution you can use is the Global Average Pooling layer, which can produce the same effect in compared to using Attention, instead of using the Fully Connected layer in the traditional classifier such as Resnet or VGG. However, temporarily, let set a side the story of Global Average Pooling.

How attention mechanism works? I assume that all the input images always includes 5 hand-written digits located at an pre-defined position like my sample image above. Now, we split the image into 5 portions.

###### Figure 1. Split the input image into 5 portions.

The purpose of this step is trying to maps this problem to Attention problems. At this point, we have a total of 5 independent images and we will train our model to find the number nine “9” and then, the model will align to the exact location of the “9” in the image. The attention here I denoted by α1,α2,α3,α4,α5α1,α2,α3,α4,α5 which is the measurement of how well of the specific portion and the target output (number nine) are aligned. So far so good? To understand the attention layer in this model, please look at the below figure.

###### Figure 2. The simple attention layer. (1) all portions will be fed to CNN to extract the feature map. (2) Then goes through attention layer to compute the attention weight which is parameterized by a simple feed-forward network. (3)Finally, all feature maps will be weighted sum by pre-computed attention weight and feed to binary classifier.

The logic in the Attention layer can be varied. In this sample, we can formulate a deterministic attention model by computing a weighted annotation vector as proposed by which is called Addictive Attention. If you are still confused or don’t know what is the exact way to apply that formula to this problem, let’s dive deeper into that.

In common cases, we can combine all features of portions into one (context vector ) by applying sum, average or max. Then, you can produce a context vector by the weighted sum all portion’s features using the Attention layer. In Bahdanau’s paper, the alignment score  is parametrized by a feed-forward network with a single hidden layer and this network is jointly trained with other parts of the model.

As far as you know, to compute the alignment score , we need query and key terms (again, please review aforementioned notions in Mr. Tony’s article), in Bahdanau’s paper, the query and key can be the encoded hidden state and decoded hidden state in the previous word , or in Show Attend and Tell paper, they can be the set of feature vectors and the decoded hidden state in the previous word in caption . For instance, in Show, Attend and Tell implementation, they form the soft attention model as:

where  are weight matrices to be learned in the attention model. The illustration will be like the following:

###### Figure 3. Scoring function in Attention Layer of Show, Attend and Tell.

The code to implement the Attention Layer will be shown later in this document. For our case of the hand-writing data set to classify whether the image including number nine or not, the scoring function could be a little different when we can align all portions to one specific output (number nine) – which will play the role of query in the scoring function. Since we have only one query, we no longer need the ‘key’ to compute  anymore. The scoring function will become something like:

Hold on! So, what is the difference between the above Attention Layer and normal MLP Layer?*. The answer in this case is: ABSOLUTELY NO.

However, life is no dream, please remember that the problems will not limit in binary classification or verifying that whether the image including number nine. In many problems, we must spread out our problem to localize all numbers in the image. If it is the case, what should we do?

For example, We have a total of 10 numbers as labels. All the denoted labels will be embedded as the query vectors . Now we can construct the scoring function in Attention Layer in the same way that we showed above:

As I mentioned above, the scoring output  represents how well the portion ii-th and the target output  are aligned. So if we visualize the output of scoring function after train, we expect it will be something like:

###### Figure 4. A visualization of attention weight αi.

That’s it. Now, you should be confident in implementing an Attention layer into a computer vision task.

### Image captioning with Show, Attend and Tell

###### Figure 5. The overall network. Approach Overview: In step (2), image features are captured at lower convolutional layers. In step (3), a feature is sampled, fed to LSTM to generate the corresponding word. Step 3 is repeated K times to generate K-words caption

A simple description of this problem: given an image, the proposed CNN-LSTM network generates image captions. Figure 5 indicates that with every output word, the model focus to attend on the specific portions in the image. For instance, the word bird rounded by blue color was inferred from the blue square region which corresponded to the bird position in the image. Similarly, the word “water” interred from the water region that the bird flying over in the image (not the bird anymore).

### CNN fits for Attention Mechanism

In figure 2, we split the image into 5 small images and then use CNNs to get the features corresponding to the images. In the image captioning problem, we could not split the input image and use the CNNs for each part, it can be the cause of bad performance.

###### Figure 5. Image owned by Yunchen. Image show how CNN helps to get the features corresponding to small sections of the image.

As figure 5 has shown, we can get a feature corresponding to only a small subsection of the image. The output of the convolutional layer encodes local information and not the information pertaining to the whole cluttered image. The outputs of the convolutional layer are 2d feature maps where each location was influenced by a small region in the image corresponding to the size (receptive field) of the convolutional kernel. The vector extracted from a single feature map at a particular location and across all the dimensions signifies the feature for a local region of the image.

The paper said, ” in order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer, unlike previous work which instead used a fully connected layer.”. They no longer use a fully-connected layer (or now is Global Average Pooling), they use the lower Conv layer to get 2-D feature maps that represent the small regions of the image. And it denoted by:

Using lower-level representation can help preserve the useful information in the image but working with these features necessitates a powerful mechanism to steer the model in important information to the task at hand. Therefore, they show in the paper how to learning to attend at different locations in order to generate a caption.

And they present two variants of the function : a “hard” stochastic attention mechanism and a “soft” deterministic attention mechanism which will be discussed in this document right now.

### Stochastic “Hard” Attention

They present the location variable called as where the model decides to focus attention when generating the word. i.e If we have 4 portions from the image, we have set of . And if the model focus attention on the first portion, we have

Attention score is used as the probability of the  location getting selected. And, context vector  will become the summation all all locations that model focus on.

We could use a simple argmax to make the selection of , but it is not differentiable. We can apply some techniques like sampling method, variational inference, variance reduction or reinforcement learning to reach the idea in the paper but let me explain a brief of that thing.

Our model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words.

Our target now is to maximize the marginal log-likelihood  of observing the sequence of words yy given image feature . In real life, we can’t directly maximize that function, instead, we will maximize the variational lower bound or evidence lower bound – ELBO on that marginal log-likelihood. Here are the steps to get the ELBO .

There are 2 ways to take the ELBO: marginalizing and KL-divergence. Both will get the same result.

#### Marginalizing

• Use the product rule of Probability, we have:
• So we have:

#### KL-divergence

• According to Bayesian Theorem we have:

In real life, is hard to take. So we find some approximation of distribution that closed to the true posterior . KL divergence is a measure of how approximated distribution  is different from the posterior distribution .

Why we can take ELBO (LL) like the above formula? Here is the explanation

That’s it. And now, in our problems, just change X with y and Z with s, and also, replace the integral by summation where our case is a discrete variable.

Equation (4) gained by adding condition . So you can see that the equation (4) and (3) are totally similar.

#### Monte Carlo Sampling

Once we got the Variational Lower Bound  at the equation (4). The learning algorithm for the parameters WW of the models can be derived by directly optimizing  following its gradient

In equation (5), the summation over s can not be obtained in the learning process. Using the Monte Carlo sampling method allows the summation over s can be approximated by a finite sum such that:

where is a sequence of sampled attention location. Please note that the author mentioned that they sampled the location from a multinouilli distribution (categorical distribution) due to treating the attention locations as the intermediate latent variables. Multinouilli distribution is this case describes the possible results of the attention location that can focus attention on one of L possible positions. This distribution represented by the set of parameters  just the same with equation (1).

In the paper, the author uses some more techniques to enhance the Monte Carlo sampling method like variance reduction using moving average baseline technique or forming the REINFORCE learning rule where the reward for the attention choosing a sequence of actions is a real value proportional to the log-likelihood of the target sentence under the sampled attention trajectory. However, I will dive deep into that by different articles.

### Deterministic “Soft” Attention

Continue with the notation from Hard Attention section, instead of sampling the attention location each time we train the model, we can take the expectation of context vector directly.

and formulate a deterministic attention model by computing soft attention weighted annotation vector  as proposed by Bahdanau et al. (mentioned in the first example of this article). This corresponds to feeding in a soft α weighted context into the system.

Since then, the model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard back-propagation. Additionally, the soft attention model predicts a gating scalar β from previous hidden state at each time step , such that

In practical, the implementation can form the attention weight  such that

Here is the repository that already implements Deterministic “Soft” Attention: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning with well-documented and I think it is very useful for those who want to see how it work in practice.

### Reference

Written by: Tuan Anh Vu

Edit: Tony Nguyen

Like & Follow PIXTA Vietnam to be updated with tech news!