1 5

Attention (again)!  A long survey 

Born of attention idea

Attention could be much older than many data scientists, AI engineers in the world. Till 2020, it was the 55-year-olds algorithm. The original idea of Attention should be credited to Geoffrey Watson and Elizbar Nadaraya. These scientists introduced this algorithm in 1964 when they aimed to solve the problem of Regression: Estimating label y (y1, ym) given values of x (x1, xm)[Figure 1]. Watson, Nadaraya proposed that weighing the labels of y  according to the locations of x could achieve better results. Let’s take a look into the equation which the two authors formulated in that year:


Figure 1. Regression problem [reference 1]


The above beautiful definitely reminds us about the equation of the “modern” attention algorithm including key, query, and value [1]. Don’t rush, we will get into the mentioned terms in this article later.


Why bother with a nearly 60-year-olds algorithm?

In recent years, artificial intelligence (AI) systems have been increasing their intellect by imitating human brain structure as well as human behaviors. One of them is Attention technique. Attention is a cognitive behavior of humans of concentrating on several parts of information while ignoring other perceivable information. Human beings posited this capability from many thousand years ago; however, for machines, this idea was just presented for the first time by Bahdanau et al. in ICLR 2015 in Neural Machine Translation. Although Bahdanau started the Attention trend by applying for machine translation; recently, attention was implemented into several domains thanks to the contribution of the AI community in the world.

Since 2017, Attention has gathered tons of attention from machine learning/deep learning researchers, specifically people who work in Natural Language Processing (NLP) and Neural Machine Translation (NMT). In fact, the technique could outperform the state-of-the-art algorithms’ (RNN, LSTM) capability in processing very long sentences as well as achieving a higher BLEU score. Plus, attention opens a huge amount of opportunities for AI researchers to implement attention to their research. 

Recently, besides the success of Attention Mechanism on Natural Language Processing, Attention also presented some promising results in computer vision to generate captions for images. The appealing results were described in “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”. 

2017 was the year of Attention. In 2017, Google Brain published an exciting paper named “Attention is all you need”, which once again stirred up AI researchers and Deep Learning lovers.


Hold on, what the heck Attention is.

Please learn by heart that attention theory gets involved in three components: a query Q, and a set of key-value pairs (K, V). Once again, please repeat after me: Queries, Keys, and Values. 

The algorithm computes a weighted sum of Values dependent on the Queries and the corresponding Keys. A query determines which values to focus on; or simply, we can say that the query ‘attends’ to which values.


Equation of attention with query, key, and value

Attention architecture, the Attention in terms of Deep Learning, was introduced in the paper of Bahdanau[2] based on the architecture of  seq2seq encoder-decoder[3]. However, the structure of attention was advanced in comparison with traditional encoder-decoder architecture because it can maintain long-range dependencies better, while the traditional seq2seq Encoder compresses the whole meaning of an input sequence into a single vector into the output of the last hidden layer.


Sequence to Sequence neural network structure [source Medium]

Attention is more advanced when combining relevant information of input sequence in producing each output. Instead of capsulizing information of a big sequence into a vector, attention combines information from different parts of the sequence before transferring it to the  output.


Attention neural network structure[source Medium]

Context vectors are the sum of multiplications of hidden inputs with corresponding weighted. 


But why Query, Key, Values?

For many people, three terms Query, Key, Values might be confused because they have very few relationships (at least in words meaning) with Machine Learning and Deep Learning. For me, one might refer to these terms to a retrieval system. For instance, when you input a query to search for some photos on PixtaStock.com; the search service will map the query against a set of keys (tags, descriptions, categories, etc) associated with candidate images in the database, afterward presenting you the best-matched images(values).

For attention, one might understand that query and a set of keys are used to find out the most important parts among a set of values.


Family of Attentions


“Self-Attention, also known as intra-attention, is an attention mechanism that relates distinctive positions of an input sequence in order to compute the representation of the same sequence.”

In terms of mathematics, self-attention is a sequence-2-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. These two vectors have the same dimension k. For example, let’s call the input vectors x1, x2, …, xt and the output vectors y1, y2, … , yt. In order to produce yi, the self-attention simply calculates the sum over all the multiplication of all input vectors.


The weight wijis derived from a function over xi and xj; in this case, the most simple one is dot-production.



Because the output of dot-production returns values between negative and positive infinity, we apply softmax[2] to keep values in range of [0, 1] and to ensure the sum to 1 over the whole sequence.


The equations and steps are simple. However, it is not an obvious situation to understand the story behind it.

The most astonishing point in self-attention was that attention by itself is a strong mechanism to do all the learning. Due to that fact, A. Vaswani named his paper “Attention is all you need”.

For each embedding word, we calculate Query, Key, and Value by multiplying the embedding of each word in the sequence with three trained matrices. Query, Key, and Value are useful abstractions that can clearly demonstrate the concept of attention and could be explored in detail in the following sections. These outputs are 64 dimensional vectors, which is much shorter than  the dimensionality of embedding 512.


Attention flow [source Medium]

Understanding why self-attention work?

I definitely want to explain this phenomenon in my own words; however, Peter Bloem at Vrije Universiteit Amsterdam explained the theory in an absolutely intuitive way in an article on his blog in October, 2019.

“To apply self-attention, we simply assign each word t in our vocabulary an embedding vector vt (the values of which we’ll learn). This is what’s known as an embedding layer in sequence modeling. It turns the word sequence



into the vector sequence

𝐯the, 𝐯cat, 𝐯walks, 𝐯on, 𝐯the, 𝐯street.


If we feed this sequence into a self-attention layer, the output is another sequence of vectors

𝐲the, 𝐲cat, 𝐲walks, 𝐲on, 𝐲the, 𝐲street


where 𝐲cat is a weighted sum over all the embedding vectors in the first sequence, weighted by their (normalized) dot-product with 𝐯cat.

Since we are learning what the values in 𝐯t should be, how “related” two words are is entirely determined by the task. In most cases, the definite article the is not very relevant to the interpretation of the other words in the sentence; therefore, we will likely end up with an embedding 𝐯the that has a low or negative dot product with all other words. On the other hand, to interpret what walks means in this sentence, it’s very helpful to work out who is doing the walking. This is likely expressed by a noun, so for nouns like cat and verbs like walks, we will likely learn embeddings 𝐯cat and 𝐯walks that have a high, positive dot product together.

This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input sequence are, with “related” defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products.”


And there are two properties that Self-Attention contains:

  1. There are no parameters. This is because what self-attention does is determined by the input sequence.
  2. Self-attention sees its input as a set instead of a sequence. It is a permutation equivariant which is clearly defined in Deep Sets paper [4]. The mechanism ignores the order of each word vectors in the input sequence.



Researchers at Google Brain came up with the idea Transformer in paper “Attention is all you need” in 2017.  As other members in the Attention family, Transformer contains two blocks: Encoder and Decoder; excepts, it eliminated recurrent network units.



We would like to describe the details of Transformer by the following equation:


In the equation in the paper, Q denoted for the query (vector representation of one word in an input  sequence), K and V are consequential keys and values, which are also the vector representations of all the words in the sequence.



The encoder in the transformer compress input sequence into a attention-based representation with capability of locating a specific piece of information from a potentially infinitely -large context.


Transformer’s Encoder [2]

In the original paper of Transformer, the authors constructed the encoding path with a stack of six encoders (N=6). I guess, this number should not be a magical number but was based on the experiments of the authors. 

Each encoder has the same network structure which combines one multi-head self-attention layer and one simple position-wise fully connected  feed-forward neural network layer.  

These two sublayers adopts a residual connection and a layer normalization. In Mathematics, the authors noted as: LayerNorm(x + Sublayer(x)). Due to the residual connection performance, outputs of sub-layers as well as embedding layers produce outputs with a dimension d=512.





Transformer’s decoder [2]

The decoder is also a stack of six identical layers (N=6). However, instead of constructing two blocks of sub-layers, each decoder adds the third sub-layer, a Multi-Head Attention which performs attention over the output of an encoder stack. The authors also modified the Attention blocks with masks, which was to ensure that the model is auto-aggressive[5].


Besides attention sublayers, both encoder and decoder contains a fully connected feed-forward network, which constructed by two linear transformations with a ReLU activation in between.

Untitled `


Linear layer is simply a fully connected neural network that projects output of decoder into a  logits vector. Each value in this vector corresponds to the score of a unite word. The Softmax layers then convert these scores to probabilities. Of course, the word with the highest probability will be chosen as the final output.


Due to the absence of Convolutional and Recurrent Network Structure in Transformer, the authors of the papers have to maintain the order of input sequence by adding them with a positional encoding. The positional encodings represented by sine and cosine functions of different frequencies. The authors chose wavelengths form because they hypothesized that models might learn to attend by relative positions. For example, Positional Encoding at a position Pos with k offsets could be represented by Positional Encoding at Pos.


Attention in Transformer is called “Scaled Dot-Product Attention” and described as the figure below.


Scaled Dot-Product Attention[2]

The input contains matrix Q (query) and K (keys) with the dimension of dk, which are computed dot-product, then divided by sqrt(dk), and apply softmax to obtain weights on the V (values).

The researchers who came up with the idea of attention before reading “Attention is all you need” might be familiar with the idea of additive attention[6]. Due to the complexity of additive attention which is computed using a feed-forward network with a single hidden layer, dot-product attention seems a better option with faster and space-efficient operation. Dot product scaled factor 1/sqrt(dk) is to prevent the dot products from growing large in magnitude, pushing the softmax function into regions where it has extremely small gradients.                                                                                                      


Multi-head in attention produces two main advantages:

  1. The model is able to concentrate on several different positions.
  2. It gives the attention layer multiple “representation subspaces” because instead of using one set of Query/Key/Value weight matrices.

Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent outputs are simply concatenated and linearly transformed into the expected dimensions. According to the paper, “multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”


Multi-head attention in “Attention is all you need”

Keep transforming attention! 

The aforementioned information is such a grand of sand in the ocean. From 2017, there are enormous transformations that AI researchers implemented for Attention as well as the Transformer Mechanism. Due to this fact, I suggest that you should try to develop your own transformer implementation far beyond the scope of Machine Translation, for instance: Computer Vision, Speech Recognition, etc. ( You might want to name it Decepticon or Auto-bot version :D). You should take a look into interesting references below. One of them was the tutorial of the author of “Transform is All You Need”, Lukasz Kaiser. He gave his talk about this topic at Pi-School, Rome, Italy where he works as an data science mentor, and where I worked as a scholar. 


[1]. Attention in Deep Learning – a tutorial by Alex Smola and Aston Zhang at ICML19 1. 

[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need by (2017)

[3] Ilya Sutskever Oriol Vinyals Quoc V. Le Sequence to Sequence model in Neural Networks (2014)

[4] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander Smola, “Deep Sets”, (2018)

[5] https://en.wiktionary.org/wiki/autoaggressive


[7]. Tutorial of Lukasz Kaiser at Pi School, Rome, Italia

[8]. Illustrated Transformer by Jay Alammar 

[9]. Google the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor 2 Tensor announcement.


Written by: Tony Trần