LeNet-5: A Visual Guide

Interactive visualizations of the pioneering neural network that learned to read handwritten digits for banks.

LeNet-5

In 1998, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner published Gradient-Based Learning Applied to Document Recognition. The paper culminates ten years of research and introduces LeNet-5, a convolutional neural network trained on MNIST and deployed by banks to read checks. It was one of the first neural networks to succeed in a real-world application.

The visualizer below shows data flowing through each layer. You can draw a digit to run it through the network.

Loading Model…

Raw

Input

Output

0123456789□

Activation / Distance

-Max0+Max

Architecture

LeNet-5 processes an image through 7 trainable layers. Convolutional layers detect local features by applying learned filters, followed by a squashing activation. A digit is recognized by how its features relate to each other, not by their exact pixel positions. To achieve this, subsampling layers reduce the resolution of each feature map, making the output less sensitive to shifts and distortions.

Architecture of LeNet-5. Reproduced from the original paper using NN-SVG

C layers — convolution over 5×5 areas, extract local features
S layers — 2×2 averaging with trainable weight and bias, reduce sensitivity to shifts and distortions
F6 — fully connected layer, 84 units
Output — 10 units, each measuring distance to a 7×12 digit pattern

Every layer applies a scaled hyperbolic tangent as its activation function. Compared to tanh, it has a steeper slope in the active input range, giving gradients more signal during training. The switch in the top right of each visualizer toggles pre- and post-activation values.

scaled_tanh(x) = 1.7159 × tanh(2/3 × x)

Scaled tanh has steeper slope in the active input range

Input

Pixel values are first divided by 255, then linearly shifted. The background level maps to −0.1 and the foreground level to 1.175. This keeps the mean close to 0 and the variance close to 1, where the scaled tanh is most sensitive.

The original MNIST digits are 28×28. Two pixels of padding are added on each side, giving a 32×32 input. This ensures features near the edges are fully visible to the filters.

Raw (28×28)

➔

Normalized (32×32)

-0.1000

C1 Layer

C1 applies 6 learned filters to the input, each producing a 28×28 feature map. Each map captures a different local pattern. The math behind each output value is shown in the visualizer.

Try drawing a + in the visualizer above. Filter 4 lights up strongly along the horizontal bar. If you draw an X, Filter 5 responds most strongly to the / diagonal.

Raw

➔

-0.1260

For more details on how image kernels work, see this visual explanation.

S2 Layer

Each feature map is downsampled by averaging non-overlapping 2×2 blocks, halving spatial dimensions to 6 maps of 14×14. Unlike modern average pooling, each map has a trainable weight and bias applied after the average.

A digit shifted one pixel should still be recognized the same way. Averaging over 2×2 blocks absorbs small changes in position. Try shifting the input to see how the output changes.

Raw

➔

0.1230

C3 Layer

A second convolution produces 16 maps of 10×10. Each output map connects to only a subset of the 6 input maps. The paper uses this sparse connection table to reduce parameters and break symmetry, forcing maps to learn different feature combinations.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	X				X	X	X			X	X	X	X		X	X
2	X	X				X	X	X			X	X	X	X		X
3	X	X	X				X	X	X			X		X	X	X
4		X	X	X			X	X	X	X			X		X	X
5			X	X	X			X	X	X	X		X	X		X
6				X	X	X			X	X	X	X		X	X	X

Each column indicates which S2 feature maps are combined by the units in a particular C3 feature map. Reproduced from the original paper.

Where C1 detected low-level features like edges and corners, C3 encodes relationships between them. This is how it distinguishes a "3" from an "8".

Raw

Connected to S2 channels: [1, 2, 3]

➔

0.3590

S4 Layer

Same operation as S2. 2×2 average pooling shrinks the 16 maps from 10×10 to 16 maps of 5×5.

Raw

➔

0.2190

C5 Layer

120 filters of size 5×5 applied to the 16 maps of 5×5. The filters are exactly the size of the maps, so each output is a single value. 120 maps of 1×1.

In a modern network this would typically be a flatten followed by a fully connected layer. The original paper keeps it as a convolution on purpose. The network was designed to process multiple characters in a single pass. This is how LeNet-5 is used inside the SDNN.

Raw

S4
(16@5×5)

∗

Kernels
Unit 1

K10

K11

K12

K13

K14

K15

K16

➔

C5
(120)

0.0250

F6 Layer

A fully connected layer maps the 120 activations down to 84 units. The number 84 corresponds to a 7×12 grid encoding of ASCII characters, the target representation the network was originally designed to match.

Raw

C5
(120)

120

Weights
(84×120)

184

1120

➔

F6
(84)

-0.2850

Output

A modern network would end with a softmax over the target classes. LeNet-5 does something different. Each output unit holds a fixed 7×12 pattern, one per class.

Target patterns for each digit class and a blank space class. Reproduced from the original paper.

The 84 dimensions of F6 were chosen to match these patterns, mapping the output onto the same 7×12 grid.

Raw

F6
(84)

➔

F6 Grid
(7x12)

0.0000

Each RBF (Radial Basis Function) unit computes the Euclidean distance between the F6 grid and its pattern. The class with the smallest distance is the predicted digit.

F6 Grid

−

Pattern 0

Diff

→

Distances

084.0

□

84.0000

The 7×12 grid is a visual convenience. The distance is computed directly on the 84-dimensional vectors.

SDNN

The paper presents two approaches to multiple character recognition. The first is Heuristic Over-Segmentation (HOS), which uses a 32×32 sliding window, running LeNet-5 on each extracted patch. The second is the Space Displacement Neural Network (SDNN), which runs LeNet-5 once across the full strip without any segmentation.

This works because C5 is a convolution rather than a fully connected layer. On a wider input, the network produces a sequence of F6 vectors instead of just one. A Graph Transformer Network (GTN) then removes weak and duplicate candidates to produce the final digit sequence.

Loading Model…

Raw

32×60

C1 5×5

28×56

S2 2×2

14×28

C3 5×5

10×24

S4 2×2

5×12

C5 5×5

120×8

84×8

The visualizer shows the F6 outputs as 7×12 grids. The GTN filtering step is beyond the scope of this article.

Training

The network was trained on 60000 MNIST examples plus 6000 blank spaces for the SDNN pipeline. Weights are initialised small enough to keep early activations in the sensitive region of the scaled tanh.

Training ran for 20 epochs using online stochastic gradient descent (batch=1), as the paper recommends. The learning rate is stepped down across epochs to allow finer convergence.

The loss function has two parts. One pulls the F6 output toward the correct target pattern. The other penalises incorrect classes when their distance is too small.

Training and test error were tracked over 20 epochs. Both curves drop steadily, with the test error closely following the training error.

Training and test error curves over 20 epochs

Error is used instead of accuracy to match the original paper. Best test error: 0.77%.

The full training code is available in the Jupyter notebook.

Conclusion

LeNet-5 set the standard for the core pattern still used in convolutional neural networks today: detect local features, reduce spatial size, then classify with fully connected layers.

Much of what came out of this work is still present in modern deep learning:

MNIST — assembled from earlier NIST handwritten digit datasets by the same team and used to train LeNet. It later became the "hello world" dataset of machine learning.
Data augmentation — the 60K MNIST training set was expanded with distorted digits, and the paper showed that the extra data reduced test error. AlexNet used the same idea in 2012, and it has been standard practice since.
Input padding — digits were padded to 32×32 so filters could detect features near the edges. Padding is now a standard parameter in convolution layers.
C3 partial connections — Each feature map connects to only a subset of previous maps to break symmetry and force maps to learn different feature combinations. AlexNet later used grouped convolutions.
C5 as convolution instead of fully connected — allows the network to run on larger inputs without retraining. Fully Convolutional Networks used the same idea in 2015.
Competitive loss — pulls the correct class closer and pushes incorrect ones away. This became a core idea in contrastive learning (SimCLR, 2020).
Weight initialization — weights were kept small so activations stayed in the sensitive region of scaled tanh. Xavier/Glorot initialization formalized the same idea in 2010.
Scaled tanh — the authors modified the activation to reduce saturation. Later work addressing gradient flow in deep networks led to the widespread adoption of ReLU.

This article was inspired by LeCun's demonstration of LeNet-5.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	X				X	X	X			X	X	X	X		X	X
2	X	X				X	X	X			X	X	X	X		X
3	X	X	X				X	X	X			X		X	X	X
4		X	X	X			X	X	X	X			X		X	X
5			X	X	X			X	X	X	X		X	X		X
6				X	X	X			X	X	X	X		X	X	X

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	X				X	X	X			X	X	X	X		X	X
2	X	X				X	X	X			X	X	X	X		X
3	X	X	X				X	X	X			X		X	X	X
4		X	X	X			X	X	X	X			X		X	X
5			X	X	X			X	X	X	X		X	X		X
6				X	X	X			X	X	X	X		X	X	X

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	X				X	X	X			X	X	X	X		X	X
2	X	X				X	X	X			X	X	X	X		X
3	X	X	X				X	X	X			X		X	X	X
4		X	X	X			X	X	X	X			X		X	X
5			X	X	X			X	X	X	X		X	X		X
6				X	X	X			X	X	X	X		X	X	X