Related Questions:
– Explain Self-Attention, and Masked Self-Attention as used in Transformers
– What are transformers? Discuss the major breakthroughs in transformer models
– Explain the Transformer Architecture
Multi-head attention extends the idea of single-head attention by running multiple attention heads in parallel on the same input sequence. This allows the model to learn different types of relationships and patterns within the input data simultaneously, thereby considerably enhancing the expressive power of the model as compared to using just single attention head.
Related Question: Explain Attention, and Masked Self-Attention as used in Transformers

(right) Multi-Head Attention consists of several attention layers running in parallel
Source: Attention is all you need (2017)
Implementation of Multi-Head Attention
- Instead of having a single set of learnable K, Q, and V matrices, multiple sets are initialized (one for each attention head)
- Each attention head independently computes attention scores and produces its own attention weighted output.
- The outputs from all the attention heads are concatenated and passed through a linear transformation to create the final multi-head attention output.
- The key innovation is that each attention head may focus on different parts of the input, thereby capturing various patterns and relationships within the data.
Benefits and Limitations of Multi-Head Attention over Single-Head Attention
– Increased Expressiveness: Multi-head attention allows the model to capture different types of dependencies and patterns simultaneously. This is crucial for understanding complex relationships in the data.
– Improved Generalization: By learning multiple sets of attention parameters, the model becomes more robust and adaptable to different tasks and datasets.
– Increased Computational Complexity: While multi-head attention enhances the model’s capabilities, it also increases the computational complexity, thereby needing more compute resources. To help mitigate this, during inference time, a mechanism called Head Pruning is employed to discard heads that are less useful.