What is an activation function, and what are some of the most common choices for activation functions?

Activation functions transform a linear combination of weights and biases into an output that has the ability to learn part of a complex function at each node of a network. The most basic activation function is the linear one, which is simply a weighted combination of the weights and biases fed into a given node. No matter how many layers or units present in the network, using a linear activation function at each node is nothing more than a standard linear model. However, much of the power of Neural Networks is derived from using nonlinear activation functions at each node. Some of the most common choices for activation functions are as follows:

  • Sigmoid (Logistic): The sigmoid function is seen in Logistic Regression and outputs values within the range of [0, 1]. Therefore, it is well-suited for use in the output layer of binary classification, where the output is interpreted as a probability value. 

  • TanH: The hyperbolic tangent function works similar to the sigmoid but outputs values in the range of [-1, 1]. For a long time, the TanH was considered an acceptable default activation function for hidden layers of a network and is still commonly used in Recurrent Neural Networks.

  • ReLU: The ReLU, or Rectified Linear Unit, is generally considered the best performing activation function for Artificial Neural Networks. It is very simply defined as , meaning that if the input is less than 0, it outputs 0, and if it is larger than 0, it outputs a constant value. There are some alternate formulations of the ReLU that make slight modifications, such as the Leaky ReLU, which instead of outputting 0 for negative values, produces a very small negative number. The ReLU activation is often preferred due to it being robust against the vanishing/exploding gradient problem, which often occurs in the training of Neural Networks. 
  • Softmax: In the case where the output layer has multiple units, such as in multiclass problems, the Softmax activation is appropriate. The output of the softmax can be interpreted as the probability of an observation belonging to each class, where the probabilities sum up to 1.