LeNet-5: A Visual Guide
Interactive visualizations of the pioneering neural network that learned to read handwritten digits for banks.
LeNet-5
In 1998, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner published Gradient-Based Learning Applied to Document Recognition. The paper culminates ten years of research and introduces LeNet-5, a convolutional neural network trained on MNIST and deployed by banks to read checks. It was one of the first neural networks to succeed in a real-world application.
The visualizer below shows data flowing through each layer. You can draw a digit to run it through the network.
Architecture
LeNet-5 processes an image through 7 trainable layers. Convolutional layers detect local features by applying learned filters, followed by a squashing activation. A digit is recognized by how its features relate to each other, not by their exact pixel positions. To achieve this, subsampling layers reduce the resolution of each feature map, making the output less sensitive to shifts and distortions.

Architecture of LeNet-5. Reproduced from the original paper using NN-SVG
- C layers — convolution over 5×5 areas, extract local features
- S layers — 2×2 averaging with trainable weight and bias, reduce sensitivity to shifts and distortions
- F6 — fully connected layer, 84 units
- Output — 10 units, each measuring distance to a 7×12 digit pattern
Every layer applies a scaled hyperbolic tangent as its activation function. Compared to tanh, it has a steeper slope in the active input range, giving gradients more signal during training. The switch in the top right of each visualizer toggles pre- and post-activation values.

Scaled tanh has steeper slope in the active input range
Input
Pixel values are first divided by 255, then linearly shifted. The background level maps to −0.1 and the foreground level to 1.175. This keeps the mean close to 0 and the variance close to 1, where the scaled tanh is most sensitive.
The original MNIST digits are 28×28. Two pixels of padding are added on each side, giving a 32×32 input. This ensures features near the edges are fully visible to the filters.
C1 Layer
C1 applies 6 learned filters to the input, each producing a 28×28 feature map. Each map captures a different local pattern. The math behind each output value is shown in the visualizer.
Try drawing a + in the visualizer above. Filter 4 lights up strongly along the horizontal bar. If you draw an X, Filter 5 responds most strongly to the / diagonal.
For more details on how image kernels work, see this visual explanation.
S2 Layer
Each feature map is downsampled by averaging non-overlapping 2×2 blocks, halving spatial dimensions to 6 maps of 14×14. Unlike modern average pooling, each map has a trainable weight and bias applied after the average.
A digit shifted one pixel should still be recognized the same way. Averaging over 2×2 blocks absorbs small changes in position. Try shifting the input to see how the output changes.
C3 Layer
A second convolution produces 16 maps of 10×10. Each output map connects to only a subset of the 6 input maps. The paper uses this sparse connection table to reduce parameters and break symmetry, forcing maps to learn different feature combinations.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | X | X | X | X | X | X | X | X | X | X | ||||||
| 2 | X | X | X | X | X | X | X | X | X | X | ||||||
| 3 | X | X | X | X | X | X | X | X | X | X | ||||||
| 4 | X | X | X | X | X | X | X | X | X | X | ||||||
| 5 | X | X | X | X | X | X | X | X | X | X | ||||||
| 6 | X | X | X | X | X | X | X | X | X | X |
Each column indicates which S2 feature maps are combined by the units in a particular C3 feature map. Reproduced from the original paper.
Where C1 detected low-level features like edges and corners, C3 encodes relationships between them. This is how it distinguishes a "3" from an "8".
S4 Layer
Same operation as S2. 2×2 average pooling shrinks the 16 maps from 10×10 to 16 maps of 5×5.
C5 Layer
120 filters of size 5×5 applied to the 16 maps of 5×5. The filters are exactly the size of the maps, so each output is a single value. 120 maps of 1×1.
In a modern network this would typically be a flatten followed by a fully connected layer. The original paper keeps it as a convolution on purpose. The network was designed to process multiple characters in a single pass. This is how LeNet-5 is used inside the SDNN.
(16@5×5)
Unit 1
(120)
F6 Layer
A fully connected layer maps the 120 activations down to 84 units. The number 84 corresponds to a 7×12 grid encoding of ASCII characters, the target representation the network was originally designed to match.
(120)
(84×120)
(84)
Output
A modern network would end with a softmax over the target classes. LeNet-5 does something different. Each output unit holds a fixed 7×12 pattern, one per class.

Target patterns for each digit class and a blank space class. Reproduced from the original paper.
The 84 dimensions of F6 were chosen to match these patterns, mapping the output onto the same 7×12 grid.
(84)
(7x12)
Each RBF (Radial Basis Function) unit computes the Euclidean distance between the F6 grid and its pattern. The class with the smallest distance is the predicted digit.
SDNN
The paper presents two approaches to multiple character recognition. The first is Heuristic Over-Segmentation (HOS), which uses a 32×32 sliding window, running LeNet-5 on each extracted patch. The second is the Space Displacement Neural Network (SDNN), which runs LeNet-5 once across the full strip without any segmentation.
This works because C5 is a convolution rather than a fully connected layer. On a wider input, the network produces a sequence of F6 vectors instead of just one. A Graph Transformer Network (GTN) then removes weak and duplicate candidates to produce the final digit sequence.
Training
The network was trained on 60000 MNIST examples plus 6000 blank spaces for the SDNN pipeline. Weights are initialised small enough to keep early activations in the sensitive region of the scaled tanh.
Training ran for 20 epochs using online stochastic gradient descent (batch=1), as the paper recommends. The learning rate is stepped down across epochs to allow finer convergence.
The loss function has two parts. One pulls the F6 output toward the correct target pattern. The other penalises incorrect classes when their distance is too small.
Training and test error were tracked over 20 epochs. Both curves drop steadily, with the test error closely following the training error.

Error is used instead of accuracy to match the original paper. Best test error: 0.77%.
The full training code is available in the Jupyter notebook.
Conclusion
LeNet-5 set the standard for the core pattern still used in convolutional neural networks today: detect local features, reduce spatial size, then classify with fully connected layers.
Much of what came out of this work is still present in modern deep learning:
- MNIST — assembled from earlier NIST handwritten digit datasets by the same team and used to train LeNet. It later became the "hello world" dataset of machine learning.
- Data augmentation — the 60K MNIST training set was expanded with distorted digits, and the paper showed that the extra data reduced test error. AlexNet used the same idea in 2012, and it has been standard practice since.
- Input padding — digits were padded to 32×32 so filters could detect features near the edges. Padding is now a standard parameter in convolution layers.
- C3 partial connections — Each feature map connects to only a subset of previous maps to break symmetry and force maps to learn different feature combinations. AlexNet later used grouped convolutions.
- C5 as convolution instead of fully connected — allows the network to run on larger inputs without retraining. Fully Convolutional Networks used the same idea in 2015.
- Competitive loss — pulls the correct class closer and pushes incorrect ones away. This became a core idea in contrastive learning (SimCLR, 2020).
- Weight initialization — weights were kept small so activations stayed in the sensitive region of scaled tanh. Xavier/Glorot initialization formalized the same idea in 2010.
- Scaled tanh — the authors modified the activation to reduce saturation. Later work addressing gradient flow in deep networks led to the widespread adoption of ReLU.
This article was inspired by LeCun's demonstration of LeNet-5.