How a Transformer Plays Tic-Tac-Toe

An interactive guide to how attention, embeddings, and positional encoding work together to predict the next move.

In 2017, the paper Attention is All You Need introduced the Transformer architecture. It is now the foundation of modern large language models. Text prediction requires complex models, and what happens inside them is difficult to see. This article uses fading Tic-Tac-Toe instead. The game is simple enough that even a basic model works well, and every step is easy to follow.

Play

Temperature
🤔
0.42
attention

The goal is to get three in a row before your moves fade. Only the last 6 moves stay on the board, older ones vanish. Moves about to disappear turn light gray, the latest played move is green.

Each square is numbered 0 to 8, so a game can be encoded as a sequence of moves:X5 O6 X4 O3 X0 O8 X2 O1 X6 O4 X7 …Because old moves disappear, the game can keep going for as long as it takes for someone to win. It also makes move order matter. In standard Tic-Tac-Toe only position counts. Here, when a move was played matters too.

The Transformer receives this sequence and predicts the next move. Six models are available. Two are complete architectures, four are ablations with one component removed:

  • 1-head — single-head attention, 32-dimensional.
  • 2-head — two-head attention, 16 dimensions each.
  • no-pos — positional encoding removed. The model sees tokens but not their order.
  • no-causal — causal mask removed. The model can attend to future positions during training.
  • no-MLP — MLP block removed. Attention is the only transformation.
  • no-res — residual stream removed. Each block overwrites rather than accumulates.

Architecture

The Transformer consists of six main blocks: tokenizer, embedding, positional encoding, attention, MLP, and unembedding.

It processes a sequence of moves in parallel — one vector per token. Each vector flows through the blocks via the residual stream, which carries information forward so that every block adds to the representation instead of overwriting it. At the end, each vector produces a prediction. Only the last one is used to choose the next move (the others are used for training). It gets appended to the sequence, and the process repeats.

Attention is the only block where tokens interact with each other. In every other block they are processed in total isolation.

Transformer architecture

Blocks are duplicated to show they operate independently on each token. Only attention uses all tokens at once.

  • Tokenizer — Converts moves into integers.
  • Embedding — Maps each integer to a 32-dimensional vector.
  • Positional encoding — Marks when each move was played in the sequence.
  • Attention — The communication layer where tokens read information from each other.
  • MLP — A reasoning step that refines the representation of each move.
  • Unembedding — Turns each vector into scores over possible next moves.

Tokenizer

The model works with vectors stored in a matrix, so it needs a numerical index to fetch them. The tokenizer maps each move to that integer index.

The 20-token vocabulary includes 9 squares for X and 9 squares for O plus special tokens START and PAD

moves 0
vocabulary 20
tokens 0
X0 → 0
X1 → 1
X2 → 2
X3 → 3
X4 → 4
X5 → 5
X6 → 6
X7 → 7
X8 → 8
O0 → 9
O1 → 10
O2 → 11
O3 → 12
O4 → 13
O5 → 14
O6 → 15
O7 → 16
O8 → 17
START → 18
PAD → 19

Play a few moves in the game to visualize.

Embedding

The embedding layer maps each token to a point in a 32-dimensional vector space the model can do math with.

tokens 0
embedding table 20×32
embeddings 0×32
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Play a few moves in the game to visualize.

These vectors are learned. They shift during training until the model finds a useful arrangement.

Positional Encoding

The math in the attention block is order-agnostic. Without help, the model cannot distinguish between sequences like X1 O2 and O2 X1.

Positional encoding fixes this by adding a unique vector to each move based on its position in the sequence. This "stamps" every token with its timing to keep track of move order.

embeddings 0×32
positions 16×32
residual stream 0×32
+
=

Play a few moves in the game to visualize.

The model uses learned positional embeddings (as in GPT-2). These parameters shift during training to represent the game's timeline.

Attention

Attention is the communication layer. Each move looks back at the sequence to gather context about the game state. The entire exchange is described by a single formula:

z=softmax(QK+Md)Vz = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d}}\right)V

This formula says: compare every move to every other move (QKᵀ), restrict to the past (M), turn scores into weights (softmax), and use them to mix information (V).

Before computing attention, the vectors are normalized so all values are on a comparable scale for stable training.

residual stream 0×32
normalized 0×32

Play a few moves in the game to visualize.

Note: Layer normalization is not necessary for this small model, but it is shown here as it is a standard part of the Transformer architecture.

The normalized vectors are transformed into three learned projections: Queries (Q), Keys (K), and Values (V). Because all three come from the same sequence, this is called self-attention.

WQ 0×0
softmax ⁣(QK ⁣+Md)V\textcolor{#BBBBBB}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q}\textcolor{#BBBBBB}{K}^{\!\top}+ M}{\sqrt{d}}\right)\textcolor{#BBBBBB}{V}}
normalized 0×0
Q 0×0

Play a few moves in the game to visualize.

Note: Matrix multiplications visualised as a row–column diagram. Rows are separated to show each token is processed independently.

Multiplying Q and K is the first time tokens communicate. It produces a score for every pair of moves, measuring how relevant each move is to every other.

KT 1×1
softmax ⁣(QK ⁣+Md)V\textcolor{#BBBBBB}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q}\textcolor{#111111}{K^{\!\top}}+ M}{\sqrt{d}}\right)\textcolor{#BBBBBB}{V}}
Q 1×1
scores 1×1

Play a few moves in the game to visualize.

A causal mask (M) sets future positions to -∞ so the model can only look at the past. Scores are divided by d\sqrt{d} for stability, where d = 32 is the attention head dimension. Row-wise softmax then turns each row into attention weights that sum to 1.

softmax ⁣(QK ⁣+Md)V\textcolor{#111111}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q}\textcolor{#111111}{K^{\!\top}}+ M}{\sqrt{d}}\right)}\textcolor{#BBBBBB}{V}
scores 0×0
masked scaled 0×0
attention 0×0

Play a few moves in the game to visualize.

The model uses these weights to pick information from each move's Value (V). This is the final exchange of information. Weighted rows of V are combined into context vectors summarizing the game state.

V 0×0
softmax ⁣(QK ⁣+Md)V\text{softmax}\!\left(\dfrac{QK^{\!\top}+ M}{\sqrt{d}}\right)V
attention 0×0
context vectors 0×0

Play a few moves in the game to visualize.

The context vectors are added back to the residual stream, updating each token with information from the sequence.

residual stream 0×1
context vectors 0×1
residual stream 0×1
+
=

Play a few moves in the game to visualize.

Multi-Head Attention

Multi-head attention runs several attention operations in parallel, each on a lower-dimensional slice of the residual stream. The 2-head model uses two heads of 16 dimensions each. The benefit is not capacity but diversity, as each head can specialise on different patterns.

Both heads compute Q, K, and V independently from the same input. The total Q, K, V capacity remains the same as single-head attention, just split equally between the two heads.

WQ 0×0 · 2 heads
softmax ⁣(Q1K1 ⁣+Md)V1\textcolor{#BBBBBB}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q_{1}}\textcolor{#BBBBBB}{K_{1}}^{\!\top}+ M}{\sqrt{d}}\right)\textcolor{#BBBBBB}{V_{1}}}
normalized 0×0
Q 0×0 · 2 heads

Play a few moves in the game to visualize.

Each head computes its own score matrix. The two are entirely independent and can weight the sequence differently.

KT 1×1
softmax ⁣(Q1K1 ⁣+Md)V1\textcolor{#BBBBBB}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q_{1}}\textcolor{#111111}{K_{1}^{\!\top}}+ M}{\sqrt{d}}\right)\textcolor{#BBBBBB}{V_{1}}}
Q 1×1
scores 1×1

Play a few moves in the game to visualize.

The causal mask and softmax are applied per head. Because the space is split, the head dimension is now d = 16.

softmax ⁣(QK ⁣+Md)V\textcolor{#111111}{\text{softmax}\!\left(\dfrac{\textcolor{#111111}{Q}\textcolor{#111111}{K^{\!\top}}+ M}{\sqrt{d}}\right)}\textcolor{#BBBBBB}{V}
scores 0×0
masked scaled 0×0
attention 0×0

Play a few moves in the game to visualize.

Each set of attention weights is multiplied by its own V to produce 16-dimensional context vectors.

V 0×0
softmax ⁣(Q1K1 ⁣+Md)V1\text{softmax}\!\left(\dfrac{Q_{1}K_{1}^{\!\top}+ M}{\sqrt{d}}\right)V_{1}
attention 0×0
context vectors 0×0

Play a few moves in the game to visualize.

Since each output is only 16-dimensional, the two are concatenated and mixed through a learned projection back into 32 dimensions. The result is then added to the residual stream. This projection is the only addition of learned weights compared to single-head attention.

WO 0×32
z=[z1,z2]WOz = [z_1, z_2] W_O
concat context vectors 0×0
residual update 0×32

Play a few moves in the game to visualize.

MLP

While attention moves information between tokens, the MLP (Multi-Layer Perceptron) processes each move individually. It acts as a reasoning step that updates the representation of each move.

The MLP also begins with normalization, using its own independent learned parameters.

residual stream 0×32
normalized 0×32

Play a few moves in the game to visualize.

The first linear layer expands the vector to 128 dimensions. A GELU activation provides the nonlinearity needed to recognize strategic patterns like winning lines, blocks, and traps.

Wup 128×32
m=xWup+bm = x W_{up} + b
normalized ×
expanded features ×128

Play a few moves in the game to visualize.

The second linear layer projects the 128 dimensions back down to 32. This compresses the reasoning so it fits back into the residual stream.

Wdown 32×128
r=mWdown+br = m W_{down} + b
expanded features ×
mlp features ×32

Play a few moves in the game to visualize.

The final result is added back to the stream. This update is much larger than the values in the residual stream and reshapes how the model sees each move.

residual stream 0×1
mlp features 0×1
residual stream 0×1
+
=

Play a few moves in the game to visualize.

The architecture can stack more attention-MLP layers to handle harder tasks. For Tic-Tac-Toe, one layer is enough to produce an accurate prediction.

Unembedding

Unembedding is the final step. The model projects every vector in the residual stream into scores (logits) for the 20 possible tokens in the vocabulary.

These logits are then converted into a probability distribution via row-wise, temperature-scaled softmax.

WU 1×20
logits=xWU\text{logits} = x W_U
residual stream 1×1
X0
X1
X2
X3
X4
X5
X6
X7
X8
O0
O1
O2
O3
O4
O5
O6
O7
O8
S
P

Play a few moves in the game to visualize.

The temperature controls the sharpness of the distribution. Low temperature concentrates probability on the top choice. High temperature flattens it, making weaker moves more likely.

To predict the next move, the model takes the last row of probabilities and builds a cumulative distribution. It then draws a random horizontal line and selects the move where it crosses the staircase.

10
X0
X1
X2
X3
X4
X5
X6
X7
X8
O0
O1
O2
O3
O4
O5
O6
O7
O8
S
P
T=1.00

Play a few moves in the game to visualize.

The selected token is appended to the sequence, and the process repeats for the next turn.

Analysis

Attention

Every model learns a different attention pattern.

1-head
2-head
no-res
no-pos
no-causal
no-MLP
  • 1-head — learns to ignore moves that have faded from the board.
  • 2-head — does the same, but splits attention across two heads: one tracks the opponent's moves, the other its own.
  • no-res — unlike the 1-head and 2-head models, the diagonal is activated because there is no residual stream to carry the current move forward.
  • no-pos — fails to identify which moves have disappeared due to the lack of a timeline.
  • no-causal — cheats by looking at the next move to predict it, so it has no strategy for the final token when the next move is unknown.
  • no-MLP — tracks player turns and basic lines but lacks the logic to calculate blocks or winning moves.

Order

Without order, the model cannot tell which X move is about to fade. It plays O5 in both games regardless of the order. The 1-head model blocks correctly on O4 in the second game.

Note: in the no-pos model, the attention scores for each move in the last row are identical regardless of its position in the sequence.

Swapping the last move does change the output, since its attention scores are what drive the prediction. No-pos now plays O1 in both cases instead of O5, but still misses the block on O4 in the second game.

Temperature

Fading Tic-Tac-Toe is deterministic. X has a forced winning strategy (try playing as O against the 1-head or 2-head model in argmax mode). Most positions have one correct move, so temperature near 0 works best. A small amount of randomness helps vary the opening, where several moves are equally good. Around temperature 1 the model starts blundering, and by temperature 2 it may predict illegal moves like placing on an occupied square or playing twice for the same player.

Training

Data

In practice, training a neural network to play Tic-Tac-Toe would be overkill. The game is small enough to solve with the minimax algorithm. An adaptation of this approach is used to simulate 50000 fading Tic-Tac-Toe games, with 20% of moves chosen at random so the model sees weaker lines and learns to respond when the opponent makes mistakes.

[START,  X1,  O6,  X4,  O7,  X8,  O0,  X2,  O5,  X6,  O1,  X4, PAD, PAD, PAD, PAD, PAD]
[START,  X5,  O0,  X4,  O3,  X1,  O6, PAD, PAD, PAD, PAD, PAD, PAD, PAD, PAD, PAD, PAD]
[START,  X5,  O0,  X3,  O4,  X8,  O1,  X7,  O5,  X6, PAD, PAD, PAD, PAD, PAD, PAD, PAD]

START marks the beginning of every game, providing context for the first move. PAD is used for training only. It fills the remaining slots to a fixed length of 17 so the model can be trained in batches.

Model

The 1-head and 2-head models use a single attention-MLP layer. The four ablations are created from the 1-head model by removing one component at a time.

Width D = 32 provides the best balance between performance and visualisation. D = 16 is too narrow to learn reliable play. D = 64 produces a stronger model but the visualisations become cumbersome.

The context length is 16. In fading Tic-Tac-Toe only 6 moves can be on the board at once, so a context of 6 would suffice. The longer window lets the model learn to ignore fading moves, one of the key behaviours explored in the Analysis section.

If a game runs longer than 16 moves, the sequence is truncated from the left. The START token is always kept, followed by the next X move to maintain a consistent sequence structure. This does not impact the game since moves older than 6 turns have already faded from the board.

Training

Both models were trained for 30000 steps with a batch size of 512 using AdamW. Play uses only the last prediction, but training uses every row. This provides 16 training signals per game.

The loss is cross-entropy, calculated for all positions in parallel. This makes training much faster than sequence models like RNNs or LSTMs, which process tokens one by one.

For scale, here is how the 1-head model compares to larger models.

WidthContextLayersVocabParams
1-head321612014K
nanoGPT38425666510M
GPT-276810241250K117M
Gemma 24608819246256K27B

1-head model compared to nanoGPT, GPT-2, and Gemma 2.

The full training code is available in the Jupyter notebook.

Conclusion

The Transformer consists of six blocks. The tokenizer, embedding, and positional encoding prepare the vectors. Attention and the MLP process the information. Unembedding produces the final prediction. A residual stream ties these components together.

The same mechanism is used in much larger models. Large language models predict the next word from a vast vocabulary using the exact same matrix operations. Every prediction is the result of these blocks working in sequence.

The following resources were helpful while making this article: