Activation Functions: The Gates of Neural Networks
The Mathematics of Decision Making
In a neural network, an Activation Function decides whether a neuron should "fire" or not. But more importantly, it introduces non-linearity. Without it, no matter how many layers you add, the whole network would just be a simple linear regression—incapable of learning complex patterns like faces or speech.
1. The Sigmoid Function
Sigmoid maps any value to a range between 0 and 1. Historically, it was the gold standard because it mimics the firing rate of biological neurons.
Equation: 1 / (1 + e^-x)
Downside: It suffers from the "Vanishing Gradient" problem. For very high or very low inputs, the gradient becomes almost zero, causing the network to stop learning.
2. The Tanh (Hyperbolic Tangent)
Tanh is similar to Sigmoid but maps values between -1 and 1. Being "zero-centered" makes it generally superior to Sigmoid for hidden layers, as it makes the optimization process more stable.
3. ReLU (Rectified Linear Unit)
ReLU is the current industry standard. It’s incredibly simple: if the input is negative, the output is 0. If the input is positive, the output is the same as the input.
Equation: f(x) = max(0, x)
Which to use?
- Hidden Layers: Almost always use ReLU (or variants like Leaky ReLU).
- Output Layer (Binary): Use Sigmoid to get a probability between 0 and 1.
- Output Layer (Multi-class): Use Softmax to get a probability distribution across categories.