AIML.com

Machine Learning Resources

Explain Self-Attention, and Masked Self-Attention as used in Transformers

Bookmark this question

Related Questions:
– What are transformers? Discuss the major breakthroughs in transformer models
– Explain the Transformer Architecture
– What are the primary advantages of transformer models?
– What are the limitations of transformer models?
– What is Natural Language Processing (NLP) ? List the different types of NLP tasks

Attention mechanisms are a crucial component in modern deep learning architectures, particularly in sequence-to-sequence tasks and natural language processing (NLP) models like Transformers. Attention allows the model to weigh the importance of different parts of an input sequence when processing each element, which is essential for capturing long-range dependencies and improving performance on various tasks.

Let’s first define some of the jargon associated with Attention:

  • Attention, also known as scaled dot-product attention refers to the mechanism of assigning importance weights to all other input tokens, when processing the current token
  • Self-attention: when importance weights to the input tokens are assigned based on tokens in the input sequence
  • Cross Attention: when importance weights to the input tokens are assigned based on tokens from some other sequence then it is called cross-attention. This is generally used with Machine Translation task when encodings of the input sentence are used to assign weights to the tokens of the output sequence.
    Related Question: What is Cross-Attention, and how does it differ from Self-Attention?
  • Masked Self-Attention: when some tokens of the input sequence are purposefully omitted (masked) from contributing to attention weights. For example, future words are masked when training the decoder layer of a machine translation model. This is generally done when working with Masked Language modeling tasks.
    Related Question: What is Masked Self-Attention?
  • Multi-headed Attention: when multiple self-attention heads are used in parallel on the same input sequence. This is generally done so that each head learns different aspects or patterns in the data. The results from all heads are concatenated and linearly transformed to obtain the final output.
    Related Question: What is Multi-head Attention and how does it improve model performance over single Attention head?
  • Masked Multi-Head Attention: when multiple masked self-attention heads are used in parallel on the same input.

The following figure provides a step-by-step explanation of the basic components of self-attention:

Step-by-step explanation of basic components of Self-Attention
Title: Step-by-Step explanation of basic components of Self-Attention
Source: AIML.com Research

Step-by-step explanation of Basic Components of Self-Attention

1. Input Tokens

– Assuming we are processing n tokens at a time, where each token is represented as an embedding of dimension embed, then we can represent the input as a matrix X of dim(n, embed).

– For the first layer, the token embedding are taken from a static embedding lookup matrix, and for subsequent layers, output from previous layers are used as token embedding for the next layer.

2. Key, Query, and Value projection matrices and associated vectors

– For each self-attention head, the following three projection matrices for dim(embed, head_size) are initialized: Key (WK), Query (WQ), and Value (WV).

– The input token embeddings are multiplied with these projection matrices to generate Key (k), Query (q), and Value (v) vectors for each of the tokens

3. Dot-Product Attention

   – The core operation in self-attention is the dot product between Query and Key vectors. It measures how much attention should be given to each (Key) token based on the (Query) token.

   – For a given token n, which we are going to call Query token, the attention score from every other token is calculated by doing a dot product of query vector of token n (qn) with the key vector of every other token (ki), ie:

Attention(Tokenn, Tokeni) = qn * ki

3.1 Scaling the Dot-Product

   – To stabilize training, the dot product is typically scaled by the square root of the dimension of the Key vectors (usually denoted as dk):

Attention(Tokenn, Tokeni) = (qn * ki) / sqrt(dk)

3.2 Normalizing Attention Weights

   – For each query token, the attention scores are normalized using softmax so that they sum to 1:

Attention_weights[Tokenn, Tokeni]
= softmax(Attention(Tokenn, Tokeni))

4. Weighted Sum of Values

   – The output for each Query is a weighted combination of the Values, where the Weights are determined by the attention scores:

Output[Tokenn] = sum(Attention_weights[Tokenn, Tokeni] * V[i]
for i in range(sequence_length))

Why Self-Attention Works?

In many sequence-based tasks, such as machine translation or text summarization, it’s often important to give varying degrees of importance to different parts of the input sequence when generating an output sequence. Traditional sequence models like Recurrent Neural Networks (RNNs) have limitations in capturing long-range dependencies effectively. Attention mechanisms address these limitations by allowing the model to “pay attention” to specific parts of the input sequence dynamically.

Self-attention is powerful because it allows the model to evaluate all positions in the input sequence simultaneously (and in parallel) when making predictions for a particular position. This allows model to capture long-range dependencies and relationships between elements, and thus is crucial for tasks like natural language understanding and generation. Additionally, using multiple attention heads allows the model to learn different types of dependencies and patterns between input tokens, making it highly expressive.

What is Masked Self-Attention?

Masked self-attention is used to ensure that the model doesn’t attend to some of the tokens in the input sequence during training or generation.

For example, when working with sequence-to-sequence tasks like Machine Translation, it’s important to prevent information leakage from future positions. What this means is when translating a sentence from one language to another, you wouldn’t want the model to attend to words that come after the current word being translated, as that would introduce information from the future.

Here’s a simplified example of the same. Suppose while training, we’re learning to translate the sentence “I am eating an apple.” to French (complete french translation: “Je mange une pomme.“), and the model is currently going to emit the word “mange” (eat). In this scenario, we would want to mask the words “une” and “pomme“, to prevent the model from learning from those words while ensuring that the generation of “mange” is conditioned only on “une“.

In terms of code implementation, tokens are masked by setting their weights to -inf before applying the softmax normalization. This ensures that after normalization these words will get a weight of zero thereby effectively forcing the model to pay attention only to positions that precede the current position in the sequence.

Masked self-attention is crucial for autoregressive language models like GPT (Generative Pre-trained Transformer) because it enforces causality and ensures that the model generates text one token at a time, conditioning each prediction only on the preceding context. This prevents the model from cheating by looking at future information during training and generation. Furthermore, masked self-attention is also used by Encoder models like BERT to randomly mask some of the words in the input sequence, while doing masked language modeling.

Self-Attention Implemented from scratch in 30 lines of PyTorch Code

The following is a complete and heavily annotated implementation of an Attention Head in ~30 lines of pytorch code. The code snippet is adapted from Andrew Karpathy’s github repo that accompanies his video lecture on implementing GPT-2 from scratch.

Python
# Please refer to the "Step-by-Step explanation of Attention" figure above, and
# to the step-by-step explanation of Basic components of Attention above for more info
class AttentionHead(nn.Module):
    """ one head of self-attention """

    def __init__(self, embed, head_size, num_tokens):
        super().__init__()
        # Step 2: Three projection matrices W^K, W^Q, and W^V of dim(embed, head_size) 
        # are initialized for an AttentionHead
        self.key = nn.Linear(embed, head_size, bias=False)  # dim(embed, head_size)
        self.query = nn.Linear(embed, head_size, bias=False)  # dim(embed, head_size)
        self.value = nn.Linear(embed, head_size, bias=False)  # dim(embed, head_size)

    def forward(self, x, head_size):
        # input of size (num_tokens, embed)
        # output of size (num_tokens, head_size)
        T, E = x.shape
        # Key, Query and Value vectors of dim(head_size) are computed for each token
        k = self.key(x)   # dim(num_tokens, head_size) Key Vectors
        q = self.query(x) # dim(num_tokens, head_size) Query Vectors
        v = self.value(x) # dim(num_tokens, head_size) Value Vectors
        # START: compute attention weights ("affinities"): START
        # Step 3: compute Dot product of Key vectors with Query vectors for every token
        wei = q.matmul(k.transpose())  # dim(num_tokens, num_tokens)
        # Step 3.1: To ensure stability of model training, scale the dot product 
        wei = wei / sqrt(head_size)  # dim(num_tokens, num_tokens)
        # Step 3.2: Normalize the Attention weights to sum to 1 using Softmax
        wei = F.softmax(wei, dim=-1) # dim(num_tokens, num_tokens)
        # END: compute attention weights ("affinities"): END
        # perform the weighted aggregation of the values for each token
        out = wei.matmul(v) # dim(num_tokens, head_size)
        return out

Video Explanation

In the following video Andrej Karpathy implements the full GPT-2 model from scratch in numpy. In the process, he explains Attention, Self-attention, Multi-headed Attention in detail in addition to other components of the transformer model.

Even though the video’s runtime is 2 hrs, this video is “All you need to understand Attention and Transformers”. The whole GPT-2 model is implemented in less than 250 lines of python code, and the associated code is shared on github.

YouTube Link: https://www.youtube.com/watch?v=kCc8FmEb1nY (Runtime: 2 hrs)

Title: This video is “All you need to understand and implement Attention and Transformers inside-out”

Other Video Recommendations:

The following two video lectures from Prof. Pascal Poupart:

Attention Easter-Egg: The Origin Story

Attention mechanism was first introduced by Dzmitry Bahdanau in a 2014 paper titled “Neural Machine Translation by Jointly Learning to Align and Translate“. This work was done, while he was doing an internship with Prof. Yoshua Bengio at University of Montreal. It was initially called “RNNSearch” by Dzmitry, but was later changed to “Attention” by Yoshua Bengio.

Pasted below is an excerpt from an email correspondence Andrej Karpathy had with Dzmitry regarding the origin story of Attention mechanism:

Origin of Attention
Title: Andrej Karpathy shares an excerpt from an email correspondence he had with Dzmitry regarding the origin story for Attention. Worth a read on how it could have been called RNNSearch instead of Attention 😀

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can