What are some guidelines for choosing activation functions?

In regards to the output layer, the choice of activation function should be compatible for the purpose of being used to predict against the target labels. For example, in a regression problem, the output should be a continuous numeric value, so reasonable choices for activation functions would be the ReLU or TanH, or even possibly just a linear activation. In a classification setting, the sigmoid is the standard activation for the output layer, since its output can be interpreted as a probability of belonging to the target class. In a multiclass setting, the softmax is the canonical choice for the activation so that the output layer consists of probabilities belonging to each class. 

There is not a rigid criteria as to which activation to choose for the hidden layers, but usually, the same activation function is applied to all units in a hidden layer. As the motivation for creating deep networks with many hidden layers is to learn complex functions, nonlinear activation functions are usually applied to learn on these different regions. In Artificial Neural Networks, the most common choices are the ReLU, TanH, and Sigmoid. While the Sigmoid and TanH activation functions are nonlinear, they are prone to saturation, which means that after applying the activation to the weighted sum of inputs, values tend to concentrate toward the extreme ends of the respective curves of the function. They are also more likely to suffer from the vanishing gradient problem, which can result in an unreliable solution. The ReLU is less likely to have either of these issues and thus is now considered to be the gold standard among activation functions to use in hidden layers.