Activation functions

When choosing an activation function, consider the following:

But what is a neural network?

Slideshow

Figure 1: Overview

Sigmoid

Strengths: Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.

Weaknesses: Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.

Usage: Binary classification, logistic regression.

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
Figure 2: Sigmoid Functions

Hyperbolic Tangent (Tanh)

Strengths: Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.

Weaknesses: Also saturates, leading to vanishing gradients.

Usage: Similar to sigmoid, but with a larger output range.

\[ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} \]

def tanh(x):
    return np.tanh(x)
Figure 3: Hyperbolic Tangent

Rectified Linear Unit (ReLU)

Strengths: Computationally efficient, non-saturating, and easy to compute.

Weaknesses: Not differentiable at x=0, which can cause issues during optimization.

Usage: Default activation function in many deep learning frameworks, suitable for most neural networks.

\[ \text{ReLU}(x) = \max(0, x) \]

def relu(x):
    return np.maximum(0, x)
Figure 4: ReLU and Variants

Leaky ReLU

Strengths: Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.

Weaknesses: Still non-differentiable at x=0.

Usage: Alternative to ReLU, especially when dealing with dying neurons.

\[ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} \]

def leaky_relu(x, alpha=0.01):
    # where α is a small constant (e.g., 0.01)
    return np.where(x > 0, x, x * alpha)
Figure 5: Leaky Relu

Swish

Formula:

where g(x) is a learned function (e.g., sigmoid or ReLU)

Strengths: Self-gated, adaptive, and non-saturating.

Weaknesses: Computationally expensive, requires additional learnable parameters.

Usage: Can be used in place of ReLU or other activations, but may not always outperform them.

\[ \text{Swish}(x) = x \cdot \sigma(x) \]

def swish(x):
    return x * sigmoid(x)

See also: sigmoid

Figure 6: Swish

Mish

Strengths: Non-saturating, smooth, and computationally efficient.

Weaknesses: Not as well-studied as ReLU or other activations.

Usage: Alternative to ReLU, especially in computer vision tasks.

\[ \text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x)) \]

def mish(x):
    return x * np.tanh(softplus(x))
Figure 7: Mish

See also: softplus tanh

Softmax

Strengths: Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.

Weaknesses: Only suitable for output layers with multiple classes.

Usage: Output layer activation for multi-class classification problems.

\[ \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}} \]

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
Figure 8: SoftMax

Softsign

Strengths: Similar to sigmoid, but with a more gradual slope.

Weaknesses: Not commonly used, may not provide significant benefits over sigmoid or tanh.

Usage: Alternative to sigmoid or tanh in certain situations.

\[ \text{Softsign}(x) = \frac{x}{1 + |x|} \]

def softsign(x):
    return x / (1 + np.abs(x))
Figure 9: SoftSign

SoftPlus

Strengths: Smooth, continuous, and non-saturating.

Weaknesses: Not commonly used, may not outperform other activations.

Usage: Experimental or niche applications.

\[ \text{Softplus}(x) = \log(1 + e^x) \]

def softplus(x):
    return np.log1p(np.exp(x))
Figure 10: SoftPlus

ArcTan

Strengths: Non-saturating, smooth, and continuous.

Weaknesses: Not commonly used, may not outperform other activations.

Usage: Experimental or niche applications.

\[ arctan(x) = arctan(x) \]

def arctan(x):
    return np.arctan(x)
Figure 11: Arc Tangent

Gaussian Error Linear Unit (GELU)

Strengths: Non-saturating, smooth, and computationally efficient.

Weaknesses: Not as well-studied as ReLU or other activations.

Usage: Alternative to ReLU, especially in Bayesian neural networks.

\[ \text{GELU}(x) = x \cdot \Phi(x) \]

def gelu(x):
    return 0.5 * x 
        * (1 + np.tanh(np.sqrt(2 / np.pi) 
        * (x + 0.044715 * np.power(x, 3))))
Figure 12: GeLU

See also: tanh

Silu (SiLU)

Strengths: Non-saturating, smooth, and computationally efficient.

Weaknesses: Not as well-studied as ReLU or other activations.

Usage: Alternative to ReLU, especially in computer vision tasks.

\[ silu(x) = x * sigmoid(x) \]

def silu(x):
    return x / (1 + np.exp(-x))
Figure 13: SILU

GELU Approximation (GELU Approx.)

\[ f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3))) \]

Strengths: Fast, non-saturating, and smooth.

Weaknesses: Approximation, not exactly equal to GELU.

Usage: Alternative to GELU, especially when computational efficiency is crucial.

SELU (Scaled Exponential Linear Unit)

\[ f(x) = \lambda \begin{cases} x & x > 0 \\ \alpha e^x - \alpha & x \leq 0 \end{cases} \]

Strengths: Self-normalizing, non-saturating, and computationally efficient.

Weaknesses: Requires careful initialization and α tuning.

Usage: Alternative to ReLU, especially in deep neural networks.