– What are transformers? Discuss the major breakthroughs in transformer models
– What is Natural Language Processing (NLP) ? List the different types of NLP tasks
– What are some of the most common practical applications of NLP?
Transformer models (or foundational models) offer several key advantages that have contributed to their widespread adoption and success in various Natural Language Processing tasks and other industry domains.
Key advantages are: (1) Parallelization, (2) Better capture of long-term dependencies, (3) High scalability, (4) Effective use of transfer learning (using Pretraining + Finetuning), and (5) Wide applicability across various industry domains. Each of the advantages are explained in more details below:
Until 2017, the leading language models were Recurrent Neural Networks (RNNs) and Long Short-term Memory networks (LSTMs). However, the landscape underwent a dramatic shift with the introduction of transformer models through Google AI’s paper “Attention is all you need” towards the end of that year.
Traditional NLP models unrolled from “left to right”, sequentially processing data one element at a time. The forward and backward passes of these models involved O(sequence length) operations that were not easily parallelizable. While GPUs have the capability to perform multiple independent operations simultaneously, their potential couldn’t be fully harnessed for these models due to the sequential nature of operations. This also inhibited training on large datasets.
Transformer models, on the other hand, brought in a paradigm shift in the process of model training. They introduced a mechanism known as self-attention, which operates on the entire sequence simultaneously (an O(1) operation). This innovation rendered transformer models highly parallelizable, leading to accelerated training and inference. These models effectively harnessed the capabilities of GPUs and TPUs infrastructure. As a result, transformers represented a pivotal advancement in the construction and training of models within the domains of neural networks and NLP.
- Long range dependencies
“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
“… the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously.” (J. R. Firth 1935)
From rule-based NLP models in the 1960s to statistical modeling techniques (N-gram models) in the 1980s, and from Hidden Markov Models to Neural Language Models (RNN, LSTM) in the 2000s, culminating in transformers in 2017, the field of NLP has seen a series of new discoveries. These advancements have predominantly aimed at enhancing the language model’s performance through a deeper contextual understanding.
Before transformers came into the picture, state-of-the-art language models processed data sequentially. For example, RNNs required O(sequence length) steps for distant word pairs to interact. They retained a hidden state that encapsulated information from preceding elements. These hidden states captured contextual information of the sequence. However, they faced challenges in capturing long-range dependencies due to issues like vanishing or exploding gradient problems.
Transformer models, on the other hand, utilize self-attention to capture contextual information. In this approach, the maximum interaction distance is O(1), as all words in the input sequence interact and attend to all words in the previous layer while making predictions for a specific position. This mechanism empowers the model to encompass elements that are far apart in the sequence, effectively modeling long-range relationships.
Transformers can be scaled to handle large datasets and complex tasks. This scalability is essential for training on massive amounts of data and achieving state-of-the-art performance. Researchers have been continuously pushing the limits of model sizes and training data, aiming to achieve superior performance
The graph below illustrates the rise of large language models over the past few years. All these models are built upon the original Transformer architecture. While certain models incorporate both encoder and decoder blocks, others utilize either an encoder or a decoder.
- Transfer Learning
Pretraining followed by finetuning has proven to be a highly successful approach for constructing machine learning models designed for language driven tasks. Transformers can be pretrained on large, general-language corpora, capturing rich linguistic patterns and semantics. These pretrained models can then be fine-tuned for specific tasks, reducing the need for extensive task-specific labeled data. The following figure depicts well the process of pretraining, finetuning, and transfer learning using transformer models.
- Wide application of transformer models across domains
Transformers have demonstrated remarkable adaptability across a variety of NLP tasks, including machine translation, sentiment analysis, text classification, and question answering, among others. Moreover, transformers are extensively employed in a diverse array of industries and functional domains, spanning Computer Vision, Speech Recognition, Robotics, Healthcare, Finance, Recommendation Systems, and beyond.
Transformer models have thus emerged as the new state-of-the-art modeling architecture, revolutionizing the field of NLP and deep learning networks.