How the Transformer Became GPT and BERT (Part 2)

This is Part 2 of a three-part series. Part 1 explains the original Transformer paper and the attention mechanism. Part 3 goes deep inside GPT’s architecture.

After the Transformer paper appeared in 2017, researchers quickly realised something interesting: you did not always need the full encoder–decoder system.

Different teams began taking parts of the Transformer and adapting them for different goals. This eventually led to two famous families of models: BERT and GPT.

How the Transformer Split Into Two Directions

The original Transformer had two halves - an encoder that understood text and a decoder that generated text. Researchers soon discovered that each half could be powerful on its own.

flowchart TD
    T["Transformer\n2017 - Encoder + Decoder"]
    T --> B["BERT\n2018 - Encoder Only\nGoogle"]
    T --> G["GPT\n2018 - Decoder Only\nOpenAI"]
    B --> BU["Understanding Tasks\nSearch · Classification\nQuestion Answering"]
    G --> GU["Generation Tasks\nWriting · Chat · Code"]

    style T fill:#1e3a5f,color:#93c5fd
    style B fill:#14532d,color:#86efac
    style G fill:#581c87,color:#d8b4fe
    style BU fill:#1e3a2f,color:#86efac
    style GU fill:#3b1f5c,color:#d8b4fe

BERT: Built for Understanding Language

In 2018, researchers at Google introduced BERT, short for Bidirectional Encoder Representations from Transformers. Instead of using the full Transformer, BERT mainly used the encoder part.

BERT was designed to understand language deeply.

Unlike earlier systems that mainly read text left to right, BERT looked in both directions at once. This helped it understand meaning more accurately.

For example:

“The bank near the river flooded.”

BERT could better understand that “bank” means the side of a river - not a financial institution - because it considers all surrounding words together rather than reading in only one direction.

BERT became highly successful for tasks such as:

  • Search engines
  • Text classification
  • Question answering
  • Language understanding systems

GPT: Built for Generating Language

At roughly the same time, researchers at OpenAI introduced GPT, short for Generative Pre-trained Transformer.

Instead of using the encoder, GPT mainly used the decoder part of the Transformer architecture. GPT had a different goal: rather than understanding language, it focused on generating language.

The model predicts what word should come next.

For example:

“The weather outside was cold, so I put on my…”

GPT learns that words such as “coat”, “jacket”, or “jumper” are likely next words. By repeating this prediction process billions of times across enormous amounts of text, GPT became surprisingly good at writing paragraphs, answering questions, generating code, and holding conversations.

How the Models Evolved Over Time

Neither BERT nor GPT stood still after their initial releases. Researchers improved them steadily, making each generation larger and more capable.

timeline
    title From Transformer to Modern AI
    2017 : Transformer - Attention Is All You Need
    2018 : BERT - Google
         : GPT-1 - OpenAI
    2019 : GPT-2 - OpenAI
    2020 : GPT-3 - OpenAI
    2022 : ChatGPT - OpenAI
    2023 : GPT-4 - OpenAI
         : Llama - Meta

Each version of GPT was significantly larger than the last. GPT-3, released in 2020, was trained on hundreds of billions of words. By the time ChatGPT appeared in late 2022, the underlying technology had become capable enough to surprise even experienced researchers.

Why the Original Paper Became Such a Success

The success of GPT and BERT also turned the original Transformer paper into one of the most cited and influential papers in modern AI.

Because the paper solved several major problems at once:

ProblemHow the Transformer Solved It
Slow sequential trainingParallel processing of all words simultaneously
Weak long-range understandingAttention connecting any two words directly
Hard to scaleArchitecture grows efficiently with more data and compute
Limited to one type of taskFlexible design adapted for understanding and generation

Over time, nearly every major language model began using Transformer ideas. Today, systems like chatbots, coding assistants, search models, and image generators still build on the same foundation introduced in 2017.

The Bigger Lesson

One of the most interesting things about the Transformer paper is that it was not originally designed to create the AI tools we know today.

It started as a research idea to improve language translation.

Yet by solving one technical problem in a smarter way, it quietly changed the direction of AI research.

Sometimes the biggest technology shifts begin with a simple idea that people do not fully understand at first.

In this case, the idea was surprisingly simple:

Pay attention to what matters.

Further Reading