Why Neural Networks Need He Init, Clipping, and Momentum
A step-by-step interactive example exploring common neural network training problems and their solutions.
Overview
Goal: Build a regression neural network that learns to approximate x² over the range [−4, 4]. The focus is on keeping the implementation minimal—no libraries, just raw JavaScript—so training runs fast and the feedback loop stays tight. This reveals the core challenges in training deep networks: instability, dead neurons, and the need for careful initialization and optimization.The network is trained using full-batch gradient descent with mean squared error (MSE) loss.
Linear Model
A single-layer network with no activation function can only fit straight lines. Stacking multiple linear layers doesn't help—they just combine into one linear operation.
Result: The network draws a straight line through the parabola. It cannot capture the curve.
ReLU Activation
Introducing ReLU (max(0, x)) allows the network to create piecewise linear segments. With enough neurons, these segments approximate the curve.
This aligns with the Universal Approximation Theorem: a single hidden layer with sufficient width can approximate any continuous function to arbitrary precision. For simple functions like x², a shallow network is often sufficient.
Result: With 16 neurons, the network fits the parabola.
Gradient Explosion
Adding a second hidden layer (16 neurons each) increases capacity. While overkill for a parabola, this setup demonstrates gradient instability in deeper networks. With standard normal initialization using randn(), gradients explode through backpropagation. The signal amplifies uncontrollably as it passes through multiple layers.
Result: Loss spikes to NaN within a few epochs.
He Initialization
He initialization scales weights by sqrt(2 / fan_in), where fan_in is the number of inputs to a layer. This keeps signal variance stable across layers, preventing explosion or vanishing.
Result: Gradient explosion is largely resolved— NaN errors are much rarer. However, the output often remains flat due to Dying ReLU. When a neuron's input becomes negative, ReLU outputs zero and has zero gradient. The neuron stops learning and never recovers.
Gradient Clipping
Gradient clipping solves the dying ReLU problem by capping gradient magnitudes (e.g., ±5). This allows higher learning rates without the risk of neurons getting stuck in negative regions and stopping learning permanently.
Result: Works well for this simple task. On real problems, gradient clipping often reduces learning speed.
Momentum
Standard gradient descent treats each step independently. Momentum accumulates a velocity term that persists across iterations, accelerating convergence in consistent directions.
Update rule: velocity = β × velocity - lr × gradient; weight += velocity
Typical momentum coefficient: β = 0.9.
Result: The network converges faster. The learning rate must be reduced to prevent overshooting due to the accumulated velocity.
From Loops to Matrices
The previous examples used simple loops for calculations. Real deep learning systems use vectorization—processing entire arrays of numbers at once.
In JavaScript, this approach is actually slower due to library (mathjs) overhead. But frameworks like PyTorch and TensorFlow require this style because it lets them offload calculations to the GPU for massive speedups.
Summary
Even a simple two-layer network reveals the core challenges of deep learning. While He Initialization, Gradient Clipping, and Momentum fix basic instability, real-world models often require additional tools like Normalization, Dropout, and Regularization to work reliably.
If you want to dig deeper, Aurélien Géron's Hands-On Machine Learning is an excellent start.