The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

Explain the need for Positional Encoding in Transformer models

Bookmark this question

Related Questions:

Explain the Transformer Architecture
What are word embeddings? Compare Static embeddings with Contextualized embeddings

Positional encoding is a technique used in the Transformer architecture and other sequence-to-sequence models to provide information about the order and position of elements in an input sequence.

Title: Use of Positional Encodings in Transformers (explained with an example):
Left: the orange highlight blocks in the figure shows the use of Positional Encoding as input to both the Encoder and Decoder blocks of the transformer model
Right: shows an example of calculating positional encoding for input Token n with embeddings of dimension 5. The positional encodings are then added element-wise to the token embedding to generate input embedding for the model
Source: Research

The need for Positional Encoding

In many sequence-based tasks, such as natural language processing, the order of elements in the input sequence is crucial for understanding the context and meaning. However, standard embeddings (e.g., word embeddings) don’t inherently contain information about the position of the elements. This is why positional encoding is necessary.

Unlike recurrent neural networks, the Transformer architecture processes all input tokens in parallel. Without positional information, the input tokens are treated as a bag-of-words, thereby making it difficult for the model to understand the sequential nature of the input. Therefore, positional encoding is added to the input embeddings to help the model understand the sequential structure of the data and differentiate between elements in different positions.

Specifically, in the Transformer architecture, positional encoding is added to the input embeddings before feeding the data into the encoder and decoder stacks. This allows the model to understand the sequential relationships between tokens in the input sequence and generate coherent output sequences, such as translations or text generation.

Here’s how positional encoding works:

Mathematical Representation

The formula for positional encoding is designed to provide a unique encoding for each position in the sequence. The positional encoding vector is then element-wise added to the original input embeddings to generate embeddings that include both semantic as well as positional information. The formula for positional encoding is as follows:

PE(pos, 2i) = sin(pos / 10000^(2i / d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))


pos is the position of the element in the sequence.

i refers to the dimension within the positional encoding vector.

d_model is the dimension of the input embeddings.

Frequency-Based Encoding

The use of sine and cosine functions with different frequencies ensures that different positions have different representations.

– The sin terms create a cycle over positions, with a frequency that decreases exponentially. This means that the positional encoding for each dimension captures a different part of the cycle.

– The cos terms create another cycle with the same properties but with an offset phase.

The choice of 10,000 as the base for the exponential function and the use of both sine and cosine functions are empirical choices that have been found to work well in practice.

Addition to Embeddings

After calculating the positional encoding vectors using the formula above, they are element-wise added to the input embeddings. This addition combines the positional information with the semantic information contained in the embeddings.

Input_with_positional_encoding =
Input_embeddings + Positional_encoding

Visualizing Positional Encoding with a change in input token position

Since, positional encoding is designed to differentiate between different positions of the input tokens, we decided to plot the following figure that shows the value of positional encoding for the first 16 input tokens with 64 dimensional embedding. As the position of input token increases, so does the number of sine and cosine cycles thereby allowing the model to understand the position and order of tokens.

Plotting positional encoding for 16 input tokens of dimension 64
Title: Value of positional encoding as position of input token changes. As the position of input token increases, so does the number of sine and cosine cycles, thereby allowing the model to understand the position and order of input sequence.
Source: Research

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics

Partner Ad

Learn Data Science with Travis - your AI-powered tutor |