GPT Under the Hood - A Technical Step-by-Step Breakdown (Part 3)

2026-01-27

This is Part 3 of a series on the Transformer and large language models. Part 1 covers the original Transformer paper. Part 2 explains how BERT and GPT emerged from it.

GPT (Generative Pre-trained Transformer) is often described as a “large language model”, but under the hood it is made up of several clear components working together. While the system is complex at scale, the core ideas are structured and logical.

This article walks through each component in detail - with a diagram at every stage.

Overview: The Full GPT Pipeline

Before diving into each part, here is the complete picture of how GPT transforms raw text into a predicted next token:

flowchart TB
    IN["Raw Text\n'The weather today is'"]
    IN --> TOK["1. Tokeniser\nText → Token IDs"]
    TOK --> EMB["2. Token Embeddings\nIDs → Vectors"]
    EMB --> POS["3. Positional Encoding\n+ Position Information"]
    POS --> B1["4. Transformer Block 1\nAttention + Feed-Forward"]
    B1 --> B2["Transformer Block 2"]
    B2 --> DOTS["...repeated N times..."]
    DOTS --> BN["Transformer Block N"]
    BN --> LN["5. Layer Normalisation"]
    LN --> PROJ["6. Linear Projection\nVectors → Vocabulary Logits"]
    PROJ --> SOFT["7. Softmax\nLogits → Probabilities"]
    SOFT --> OUT["Predicted Next Token\n'cold'"]

    style IN fill:#1e3a5f,color:#93c5fd
    style OUT fill:#14532d,color:#86efac
    style DOTS fill:#1e1e2e,color:#6b7280

1. Tokenisation: Turning Text Into Numbers

Before GPT can process language, it must convert text into a format the model can understand. This is done through tokenisation.

Instead of reading full words, GPT breaks text into smaller units called tokens using an algorithm called Byte Pair Encoding (BPE). BPE starts with individual characters and merges the most common pairs repeatedly until it builds a vocabulary of useful subword units.

For example:

"unbelievable" → ["un", "believ", "able"]

Each token is then assigned a unique integer ID from a fixed vocabulary. GPT-2 uses a vocabulary of 50,257 tokens. GPT-3 uses the same vocabulary size.

Why subword tokenisation rather than whole words?

Whole-word vocabulary would need millions of entries and still fail on rare words
Character-by-character would make sequences far too long for the model to handle
Subword units strike a balance: common words are single tokens, rare words are split

flowchart LR
    A["Raw Text\n'unbelievable weather'"] --> B["BPE Tokeniser"]
    B --> C["Tokens\n['un','believ','able',' weather']"]
    C --> D["Token IDs\n[403, 9171, 540, 6193]"]
    D --> E["Input to Embedding Layer"]

    style A fill:#1e3a5f,color:#93c5fd
    style E fill:#14532d,color:#86efac

2. Embeddings: Giving Tokens Meaning

Once tokens are converted into numbers, they are mapped into embeddings - dense vectors of floating-point numbers.

Each token ID is looked up in an embedding matrix (a large table of learned vectors). The result is a high-dimensional vector that represents meaning.

For example, GPT-3 uses 12,288-dimensional embeddings. Each token becomes a list of 12,288 numbers.

Why vectors?

Similar words end up with similar vectors after training
Relationships between concepts can be captured mathematically
The famous example: king − man + woman ≈ queen in vector space

flowchart TB
    T["Token IDs\n[403, 9171, 540]"] --> EM["Embedding Matrix\n50,257 × d_model"]
    EM --> TV["Token Vectors\nShape: [seq_len × d_model]"]

    TV --> NOTE["GPT-3: d_model = 12,288\nEach token = 12,288 numbers"]

    style EM fill:#3b2f0f,color:#fcd34d
    style NOTE fill:#1e1e2e,color:#6b7280

These vectors are not fixed - they are learned during training and gradually shift to capture the meaning the model encounters across billions of examples.

3. Positional Encoding: Adding Word Order

GPT processes all tokens in parallel, so it does not naturally know the order of words. To fix this, it uses positional encoding.

Unlike the original 2017 Transformer which used fixed sinusoidal functions, GPT uses learned positional embeddings - a separate embedding matrix indexed by position (0, 1, 2, …).

For each position in the sequence, a learned vector is added to the corresponding token embedding.

flowchart TB
    TE["Token Embeddings\n[seq_len × d_model]"]
    PE["Positional Embeddings\n[seq_len × d_model]\n(learned, one per position)"]
    TE --> ADD["➕  Element-wise Addition"]
    PE --> ADD
    ADD --> OUT["Combined Representation\n(meaning + position)"]

    style ADD fill:#3b2f0f,color:#fcd34d
    style OUT fill:#14532d,color:#86efac

This combined vector now carries two kinds of information:

What the token means (from the embedding)
Where in the sentence it sits (from the positional encoding)

GPT-2 supports a context window of 1,024 tokens. GPT-3 supports 2,048 tokens. Models like GPT-4 extend this to 128,000 tokens.

4. Transformer Blocks: The Core Engine

The main power of GPT comes from stacking many identical layers called Transformer blocks. GPT-1 had 12 layers. GPT-3 had 96 layers. Each block contains three key parts.

flowchart TB
    IN["Input from Previous Layer\n[seq_len × d_model]"]
    IN --> LN1["Layer Norm 1"]
    LN1 --> ATT["Causal Multi-Head\nSelf-Attention"]
    ATT --> RES1["➕ Residual Connection"]
    IN --> RES1
    RES1 --> LN2["Layer Norm 2"]
    LN2 --> FFN["Feed-Forward Network\n(MLP)"]
    FFN --> RES2["➕ Residual Connection"]
    RES1 --> RES2
    RES2 --> OUT["Output to Next Layer\n[seq_len × d_model]"]

    style ATT fill:#1e3a5f,color:#93c5fd
    style FFN fill:#581c87,color:#d8b4fe
    style RES1 fill:#3b2f0f,color:#fcd34d
    style RES2 fill:#3b2f0f,color:#fcd34d

4.1 Causal Self-Attention

This is the most important part of GPT.

Self-attention allows the model to decide which other tokens in the sequence are most relevant when processing each token.

For every token, the mechanism computes three vectors:

Query (Q) - what this token is looking for
Key (K) - what each token offers
Value (V) - what each token actually contributes

The attention score between two tokens is computed as:

$$\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

The $\sqrt{d_k}$ scaling factor prevents the dot products from growing too large in high dimensions, which would push the softmax into regions with very small gradients.

flowchart LR
    X["Input Vectors\n[seq_len × d_model]"]
    X --> Q["Query\nW_Q × X"]
    X --> K["Key\nW_K × X"]
    X --> V["Value\nW_V × X"]
    Q --> SC["Scores\nQ · Kᵀ / √d_k"]
    K --> SC
    SC --> MASK["Causal Mask\n(set future positions to −∞)"]
    MASK --> SF["Softmax\n(attention weights)"]
    SF --> OUT["Weighted Sum\nweights · V"]
    V --> OUT

    style MASK fill:#7f1d1d,color:#fca5a5
    style OUT fill:#14532d,color:#86efac

Causal masking is what makes GPT a generative model. By forcing each position to attend only to previous tokens (setting future positions to $-\infty$ before the softmax), the model learns to predict the next token rather than simply memorise the full sequence.

Multi-head attention runs this process in parallel across several independent “heads” (GPT-3 uses 96 heads). Each head learns to attend to different kinds of relationships - one might track grammatical agreement, another might track topic consistency, and another might resolve pronoun references.

4.2 Feed-Forward Network (MLP)

After attention, each token passes through a small feed-forward network applied independently at every position.

In GPT, this network has the structure:

$$\text{FFN}(x) = \text{GELU}(x W_1 + b_1) , W_2 + b_2$$

The hidden layer is typically 4× wider than the model dimension. In GPT-3 with d_model = 12,288, the hidden size is 49,152.

GELU (Gaussian Error Linear Unit) is a smooth non-linearity that replaces the simpler ReLU, which helps training stability
The expansion and compression give the network capacity to store learned facts and patterns
While attention handles relationships between tokens, the MLP handles depth of understanding within each token

4.3 Residual Connections and Layer Normalisation

Two techniques make it possible to train very deep networks:

Residual connections add the input of a sub-layer directly to its output:

$$\text{output} = x + \text{SubLayer}(x)$$

This allows gradients to flow backwards through the network without vanishing, and ensures that information is never completely overwritten.

Layer normalisation standardises the activations within each layer, keeping values in a stable range during training regardless of sequence length or batch size.

5. Scale: How GPT Grew Over Time

One of the most important discoveries in recent AI research is that simply making Transformer models larger - with more layers, wider dimensions, and more training data - produces dramatically better capabilities.

Model	Parameters	Layers	Heads	d_model	Context
GPT-1	117M	12	12	768	512
GPT-2	1.5B	48	25	1,600	1,024
GPT-3	175B	96	96	12,288	2,048
GPT-4	~1T (est.)	-	-	-	128,000

Each jump in scale brought qualitatively new abilities that were not present in smaller models - from coherent paragraphs (GPT-2) to few-shot reasoning (GPT-3) to complex multi-step problem solving (GPT-4).

6. Training: Next Token Prediction

GPT is trained on one surprisingly simple objective:

Predict the next token given all previous tokens.

For every position in the training data, the model produces a probability distribution over the full vocabulary. The cross-entropy loss measures how wrong that distribution is compared to the actual next token.

$$\mathcal{L} = -\sum_{t} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$

The model is then updated using gradient descent to increase the probability of the correct token.

flowchart LR
    SEQ["Training Sequence\n'The cat sat on the mat'"]
    SEQ --> PAIRS["Input → Target Pairs\n'The' → 'cat'\n'The cat' → 'sat'\n'The cat sat' → 'on'\n..."]
    PAIRS --> LOSS["Cross-Entropy Loss\n−log P(correct token)"]
    LOSS --> GRAD["Backpropagation\n(gradients)"]
    GRAD --> UPDATE["Weight Update\n(gradient descent)"]
    UPDATE --> IMPROVE["Model Improves\n→ repeat billions of times"]

    style LOSS fill:#7f1d1d,color:#fca5a5
    style IMPROVE fill:#14532d,color:#86efac

Across billions of training examples, the model learns grammar, facts, reasoning patterns, writing styles, and general world knowledge - all from this one repeated task.

After pre-training, GPT models are typically fine-tuned with Reinforcement Learning from Human Feedback (RLHF) to align responses with human preferences, which is what turned GPT-3 into ChatGPT.

7. Inference: How GPT Generates Text

Once trained, GPT generates text autoregressively - one token at a time:

It receives an input prompt
It predicts the probability distribution over all vocabulary tokens
A token is sampled from that distribution (or the most likely one is chosen)
That token is appended to the sequence
The full sequence becomes the new input, and the process repeats

flowchart TD
    P["Prompt\n'Artificial intelligence is'"]
    P --> GPT1["GPT Forward Pass"]
    GPT1 --> T1["Predicted token: 'transform'"]
    T1 --> S1["New sequence:\n'Artificial intelligence is transform'"]
    S1 --> GPT2["GPT Forward Pass"]
    GPT2 --> T2["Predicted token: 'ing'"]
    T2 --> S2["'Artificial intelligence is transforming'"]
    S2 --> DOTS["... continues token by token ..."]
    DOTS --> STOP["Stop when:\nmax length reached\nor stop token generated"]

    style P fill:#1e3a5f,color:#93c5fd
    style STOP fill:#14532d,color:#86efac
    style DOTS fill:#1e1e2e,color:#6b7280

Temperature and top-p sampling control how creative or conservative the output is. A temperature of 0 always picks the most likely token (deterministic). Higher temperatures introduce randomness and variety.

8. Why This Design Scales So Well

The strength of GPT comes from how all parts fit together:

flowchart LR
    A["Tokenisation\nText → usable numbers"] --> B["Embeddings\nNumbers → meaning"]
    B --> C["Positional Encoding\nMeaning + order"]
    C --> D["Self-Attention\nRelationships across tokens"]
    D --> E["Feed-Forward\nDeeper token understanding"]
    E --> F["Residual + LayerNorm\nStable deep training"]
    F --> G["96× repeated layers\nEmergent reasoning"]
    G --> H["Next token prediction\nSimple unified objective"]

    style A fill:#1e3a5f,color:#93c5fd
    style H fill:#14532d,color:#86efac

And crucially - there is only one training objective. This simplicity allows the model to scale predictably. More data and more compute reliably improve performance, which is a rare and powerful property.

Final Thought

Although GPT looks extremely complex from the outside, its core design is surprisingly structured and elegant.

At its heart, it is still based on one simple idea:

Read a sequence, understand context, and predict what comes next.

From that single principle - repeated across billions of parameters and trillions of training examples - an entire generation of modern AI has grown.