GPT Under the Hood - A Technical Step-by-Step Breakdown (Part 3)
This is Part 3 of a series on the Transformer and large language models. Part 1 covers the original Transformer paper. Part 2 explains how BERT and GPT emerged from it.
GPT (Generative Pre-trained Transformer) is often described as a “large language model”, but under the hood it is made up of several clear components working together. While the system is complex at scale, the core ideas are structured and logical.
This article walks through each component in detail - with a diagram at every stage.
Overview: The Full GPT Pipeline
Before diving into each part, here is the complete picture of how GPT transforms raw text into a predicted next token:
flowchart TB
IN["Raw Text\n'The weather today is'"]
IN --> TOK["1. Tokeniser\nText → Token IDs"]
TOK --> EMB["2. Token Embeddings\nIDs → Vectors"]
EMB --> POS["3. Positional Encoding\n+ Position Information"]
POS --> B1["4. Transformer Block 1\nAttention + Feed-Forward"]
B1 --> B2["Transformer Block 2"]
B2 --> DOTS["...repeated N times..."]
DOTS --> BN["Transformer Block N"]
BN --> LN["5. Layer Normalisation"]
LN --> PROJ["6. Linear Projection\nVectors → Vocabulary Logits"]
PROJ --> SOFT["7. Softmax\nLogits → Probabilities"]
SOFT --> OUT["Predicted Next Token\n'cold'"]
style IN fill:#1e3a5f,color:#93c5fd
style OUT fill:#14532d,color:#86efac
style DOTS fill:#1e1e2e,color:#6b72801. Tokenisation: Turning Text Into Numbers
Before GPT can process language, it must convert text into a format the model can understand. This is done through tokenisation.
Instead of reading full words, GPT breaks text into smaller units called tokens using an algorithm called Byte Pair Encoding (BPE). BPE starts with individual characters and merges the most common pairs repeatedly until it builds a vocabulary of useful subword units.
For example:
"unbelievable"→["un", "believ", "able"]
Each token is then assigned a unique integer ID from a fixed vocabulary. GPT-2 uses a vocabulary of 50,257 tokens. GPT-3 uses the same vocabulary size.
Why subword tokenisation rather than whole words?
- Whole-word vocabulary would need millions of entries and still fail on rare words
- Character-by-character would make sequences far too long for the model to handle
- Subword units strike a balance: common words are single tokens, rare words are split
flowchart LR
A["Raw Text\n'unbelievable weather'"] --> B["BPE Tokeniser"]
B --> C["Tokens\n['un','believ','able',' weather']"]
C --> D["Token IDs\n[403, 9171, 540, 6193]"]
D --> E["Input to Embedding Layer"]
style A fill:#1e3a5f,color:#93c5fd
style E fill:#14532d,color:#86efac2. Embeddings: Giving Tokens Meaning
Once tokens are converted into numbers, they are mapped into embeddings - dense vectors of floating-point numbers.
Each token ID is looked up in an embedding matrix (a large table of learned vectors). The result is a high-dimensional vector that represents meaning.
For example, GPT-3 uses 12,288-dimensional embeddings. Each token becomes a list of 12,288 numbers.
Why vectors?
- Similar words end up with similar vectors after training
- Relationships between concepts can be captured mathematically
- The famous example:
king − man + woman ≈ queenin vector space
flowchart TB
T["Token IDs\n[403, 9171, 540]"] --> EM["Embedding Matrix\n50,257 × d_model"]
EM --> TV["Token Vectors\nShape: [seq_len × d_model]"]
TV --> NOTE["GPT-3: d_model = 12,288\nEach token = 12,288 numbers"]
style EM fill:#3b2f0f,color:#fcd34d
style NOTE fill:#1e1e2e,color:#6b7280These vectors are not fixed - they are learned during training and gradually shift to capture the meaning the model encounters across billions of examples.
3. Positional Encoding: Adding Word Order
GPT processes all tokens in parallel, so it does not naturally know the order of words. To fix this, it uses positional encoding.
Unlike the original 2017 Transformer which used fixed sinusoidal functions, GPT uses learned positional embeddings - a separate embedding matrix indexed by position (0, 1, 2, …).
For each position in the sequence, a learned vector is added to the corresponding token embedding.
flowchart TB
TE["Token Embeddings\n[seq_len × d_model]"]
PE["Positional Embeddings\n[seq_len × d_model]\n(learned, one per position)"]
TE --> ADD["➕ Element-wise Addition"]
PE --> ADD
ADD --> OUT["Combined Representation\n(meaning + position)"]
style ADD fill:#3b2f0f,color:#fcd34d
style OUT fill:#14532d,color:#86efacThis combined vector now carries two kinds of information:
- What the token means (from the embedding)
- Where in the sentence it sits (from the positional encoding)
GPT-2 supports a context window of 1,024 tokens. GPT-3 supports 2,048 tokens. Models like GPT-4 extend this to 128,000 tokens.
4. Transformer Blocks: The Core Engine
The main power of GPT comes from stacking many identical layers called Transformer blocks. GPT-1 had 12 layers. GPT-3 had 96 layers. Each block contains three key parts.
flowchart TB
IN["Input from Previous Layer\n[seq_len × d_model]"]
IN --> LN1["Layer Norm 1"]
LN1 --> ATT["Causal Multi-Head\nSelf-Attention"]
ATT --> RES1["➕ Residual Connection"]
IN --> RES1
RES1 --> LN2["Layer Norm 2"]
LN2 --> FFN["Feed-Forward Network\n(MLP)"]
FFN --> RES2["➕ Residual Connection"]
RES1 --> RES2
RES2 --> OUT["Output to Next Layer\n[seq_len × d_model]"]
style ATT fill:#1e3a5f,color:#93c5fd
style FFN fill:#581c87,color:#d8b4fe
style RES1 fill:#3b2f0f,color:#fcd34d
style RES2 fill:#3b2f0f,color:#fcd34d4.1 Causal Self-Attention
This is the most important part of GPT.
Self-attention allows the model to decide which other tokens in the sequence are most relevant when processing each token.
For every token, the mechanism computes three vectors:
- Query (Q) - what this token is looking for
- Key (K) - what each token offers
- Value (V) - what each token actually contributes
The attention score between two tokens is computed as:
$$\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
The $\sqrt{d_k}$ scaling factor prevents the dot products from growing too large in high dimensions, which would push the softmax into regions with very small gradients.
flowchart LR
X["Input Vectors\n[seq_len × d_model]"]
X --> Q["Query\nW_Q × X"]
X --> K["Key\nW_K × X"]
X --> V["Value\nW_V × X"]
Q --> SC["Scores\nQ · Kᵀ / √d_k"]
K --> SC
SC --> MASK["Causal Mask\n(set future positions to −∞)"]
MASK --> SF["Softmax\n(attention weights)"]
SF --> OUT["Weighted Sum\nweights · V"]
V --> OUT
style MASK fill:#7f1d1d,color:#fca5a5
style OUT fill:#14532d,color:#86efacCausal masking is what makes GPT a generative model. By forcing each position to attend only to previous tokens (setting future positions to $-\infty$ before the softmax), the model learns to predict the next token rather than simply memorise the full sequence.
Multi-head attention runs this process in parallel across several independent “heads” (GPT-3 uses 96 heads). Each head learns to attend to different kinds of relationships - one might track grammatical agreement, another might track topic consistency, and another might resolve pronoun references.
4.2 Feed-Forward Network (MLP)
After attention, each token passes through a small feed-forward network applied independently at every position.
In GPT, this network has the structure:
$$\text{FFN}(x) = \text{GELU}(x W_1 + b_1) , W_2 + b_2$$
The hidden layer is typically 4× wider than the model dimension. In GPT-3 with d_model = 12,288, the hidden size is 49,152.
- GELU (Gaussian Error Linear Unit) is a smooth non-linearity that replaces the simpler ReLU, which helps training stability
- The expansion and compression give the network capacity to store learned facts and patterns
- While attention handles relationships between tokens, the MLP handles depth of understanding within each token
4.3 Residual Connections and Layer Normalisation
Two techniques make it possible to train very deep networks:
Residual connections add the input of a sub-layer directly to its output:
$$\text{output} = x + \text{SubLayer}(x)$$
This allows gradients to flow backwards through the network without vanishing, and ensures that information is never completely overwritten.
Layer normalisation standardises the activations within each layer, keeping values in a stable range during training regardless of sequence length or batch size.
5. Scale: How GPT Grew Over Time
One of the most important discoveries in recent AI research is that simply making Transformer models larger - with more layers, wider dimensions, and more training data - produces dramatically better capabilities.
| Model | Parameters | Layers | Heads | d_model | Context |
|---|---|---|---|---|---|
| GPT-1 | 117M | 12 | 12 | 768 | 512 |
| GPT-2 | 1.5B | 48 | 25 | 1,600 | 1,024 |
| GPT-3 | 175B | 96 | 96 | 12,288 | 2,048 |
| GPT-4 | ~1T (est.) | - | - | - | 128,000 |
Each jump in scale brought qualitatively new abilities that were not present in smaller models - from coherent paragraphs (GPT-2) to few-shot reasoning (GPT-3) to complex multi-step problem solving (GPT-4).
6. Training: Next Token Prediction
GPT is trained on one surprisingly simple objective:
Predict the next token given all previous tokens.
For every position in the training data, the model produces a probability distribution over the full vocabulary. The cross-entropy loss measures how wrong that distribution is compared to the actual next token.
$$\mathcal{L} = -\sum_{t} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$
The model is then updated using gradient descent to increase the probability of the correct token.
flowchart LR
SEQ["Training Sequence\n'The cat sat on the mat'"]
SEQ --> PAIRS["Input → Target Pairs\n'The' → 'cat'\n'The cat' → 'sat'\n'The cat sat' → 'on'\n..."]
PAIRS --> LOSS["Cross-Entropy Loss\n−log P(correct token)"]
LOSS --> GRAD["Backpropagation\n(gradients)"]
GRAD --> UPDATE["Weight Update\n(gradient descent)"]
UPDATE --> IMPROVE["Model Improves\n→ repeat billions of times"]
style LOSS fill:#7f1d1d,color:#fca5a5
style IMPROVE fill:#14532d,color:#86efacAcross billions of training examples, the model learns grammar, facts, reasoning patterns, writing styles, and general world knowledge - all from this one repeated task.
After pre-training, GPT models are typically fine-tuned with Reinforcement Learning from Human Feedback (RLHF) to align responses with human preferences, which is what turned GPT-3 into ChatGPT.
7. Inference: How GPT Generates Text
Once trained, GPT generates text autoregressively - one token at a time:
- It receives an input prompt
- It predicts the probability distribution over all vocabulary tokens
- A token is sampled from that distribution (or the most likely one is chosen)
- That token is appended to the sequence
- The full sequence becomes the new input, and the process repeats
flowchart TD
P["Prompt\n'Artificial intelligence is'"]
P --> GPT1["GPT Forward Pass"]
GPT1 --> T1["Predicted token: 'transform'"]
T1 --> S1["New sequence:\n'Artificial intelligence is transform'"]
S1 --> GPT2["GPT Forward Pass"]
GPT2 --> T2["Predicted token: 'ing'"]
T2 --> S2["'Artificial intelligence is transforming'"]
S2 --> DOTS["... continues token by token ..."]
DOTS --> STOP["Stop when:\nmax length reached\nor stop token generated"]
style P fill:#1e3a5f,color:#93c5fd
style STOP fill:#14532d,color:#86efac
style DOTS fill:#1e1e2e,color:#6b7280Temperature and top-p sampling control how creative or conservative the output is. A temperature of 0 always picks the most likely token (deterministic). Higher temperatures introduce randomness and variety.
8. Why This Design Scales So Well
The strength of GPT comes from how all parts fit together:
flowchart LR
A["Tokenisation\nText → usable numbers"] --> B["Embeddings\nNumbers → meaning"]
B --> C["Positional Encoding\nMeaning + order"]
C --> D["Self-Attention\nRelationships across tokens"]
D --> E["Feed-Forward\nDeeper token understanding"]
E --> F["Residual + LayerNorm\nStable deep training"]
F --> G["96× repeated layers\nEmergent reasoning"]
G --> H["Next token prediction\nSimple unified objective"]
style A fill:#1e3a5f,color:#93c5fd
style H fill:#14532d,color:#86efacAnd crucially - there is only one training objective. This simplicity allows the model to scale predictably. More data and more compute reliably improve performance, which is a rare and powerful property.
Final Thought
Although GPT looks extremely complex from the outside, its core design is surprisingly structured and elegant.
At its heart, it is still based on one simple idea:
Read a sequence, understand context, and predict what comes next.
From that single principle - repeated across billions of parameters and trillions of training examples - an entire generation of modern AI has grown.