*Related Questions:– What is an activation function? Discuss the different types and their pros and cons– Discuss the basic architecture and training of a Neural Network*

*– What do you mean by vanishing gradient and why is that a problem?*

Activation functions transform a linear combination of weights and biases into an output that has the ability to learn part of a complex function at each node of a network. The most basic activation function is the linear one, which is simply a weighted combination of the weights and biases fed into a given node. No matter how many layers or units present in the network, using a linear activation function at each node is nothing more than a standard linear model. However, much of the power of Neural Networks is derived from using nonlinear activation functions at each node. **ReLU** is one such non-linear activation function and is defined as below:

### Properties of ReLU activation function

**Linearity (for positive values):**For positive input values, ReLU is a linear function with a slope of 1, meaning it allows the gradient to pass through unchanged. This property simplifies the training process and accelerates convergence.**Non-Linearity (for negative values)**: For negative input values, ReLU returns an output of 0. This introduces non-linearity to the network, which is crucial for modeling complex, non-linear relationships in data.**Sparsity:**ReLU activation tends to produce sparse activations because it sets negative values to zero. Sparse activations can be advantageous in some cases, such as reducing the computational cost during inference.

### Advantages and Disadvantages of ReLU

The ReLU, or Rectified Linear Unit, is generally considered the best performing activation function and is one of the most widely used in modern neural networks. The key advantages of ReLU activation function are:

**Advantages of ReLU: **

**Avoids Vanishing Gradient:**Unlike sigmoid and tanh activation functions, ReLU does not saturate for positive values. This property helps mitigate the vanishing gradient problem, making it suitable for training deep neural networks.**Simple, Fast and Efficient**: ReLU is computationally efficient to compute because it involves simple thresholding. This efficiency makes it well-suited for large-scale neural networks and deep architectures.

ReLU also has a few shortcomings that have led to the development of numerous ReLU variants discussed later in the article. Disadvantages of using ReLU include:

**Disadvantages of ReLU: **

**Dead Neurons**: A common issue with ReLU is the occurrence of “dead neurons.” These are neurons that always output zero, effectively becoming inactive and not contributing to the learning process. Dead neurons can occur if the weights are initialized in such a way that the neuron’s output is consistently negative for all inputs. Variants of ReLU, such as Leaky ReLU and Parametric ReLU, have been introduced to address this issue.**Non-zero centered output:**The output of ReLU is always positive and, therefore, not centered around zero. Non-zero-centered output takes longer to converge. Consequently, the ReLU activation function is often combined with centering techniques such as layer normalization, which aid in achieving faster convergence.

### Comparison of ReLU with other activation functions

### Other Variants of ReLU

There are some alternate formulations of the ReLU that make slight modifications to the ReLU function, such as the Leaky ReLU, Parametric ReLU and ELU to deal with the shortcomings of ReLU function.

**Leaky ReLU**: Leaky ReLU addresses the “dead neurons” / “vanishing gradient” problem by allowing a small gradient for negative inputs. It has the form:

**Parametric ReLU (PReLU)**: PReLU is similar to Leaky ReLU but with the slope parameter α learned during training

**Exponential Linear Unit (ELU)**: ELU is another variant that smoothens the transition for negative inputs to avoid dead neurons. It has the following form: