How the Transformer Became GPT and BERT (Part 2)
This is Part 2 of a three-part series. Part 1 explains the original Transformer paper and the attention mechanism. Part 3 goes deep inside GPT’s architecture.
After the Transformer paper appeared in 2017, researchers quickly realised something interesting: you did not always need the full encoder–decoder system.
Different teams began taking parts of the Transformer and adapting them for different goals. This eventually led to two famous families of models: BERT and GPT.
How the Transformer Split Into Two Directions
The original Transformer had two halves - an encoder that understood text and a decoder that generated text. Researchers soon discovered that each half could be powerful on its own.
flowchart TD
T["Transformer\n2017 - Encoder + Decoder"]
T --> B["BERT\n2018 - Encoder Only\nGoogle"]
T --> G["GPT\n2018 - Decoder Only\nOpenAI"]
B --> BU["Understanding Tasks\nSearch · Classification\nQuestion Answering"]
G --> GU["Generation Tasks\nWriting · Chat · Code"]
style T fill:#1e3a5f,color:#93c5fd
style B fill:#14532d,color:#86efac
style G fill:#581c87,color:#d8b4fe
style BU fill:#1e3a2f,color:#86efac
style GU fill:#3b1f5c,color:#d8b4feBERT: Built for Understanding Language
In 2018, researchers at Google introduced BERT, short for Bidirectional Encoder Representations from Transformers. Instead of using the full Transformer, BERT mainly used the encoder part.
BERT was designed to understand language deeply.
Unlike earlier systems that mainly read text left to right, BERT looked in both directions at once. This helped it understand meaning more accurately.
For example:
“The bank near the river flooded.”
BERT could better understand that “bank” means the side of a river - not a financial institution - because it considers all surrounding words together rather than reading in only one direction.
BERT became highly successful for tasks such as:
- Search engines
- Text classification
- Question answering
- Language understanding systems
GPT: Built for Generating Language
At roughly the same time, researchers at OpenAI introduced GPT, short for Generative Pre-trained Transformer.
Instead of using the encoder, GPT mainly used the decoder part of the Transformer architecture. GPT had a different goal: rather than understanding language, it focused on generating language.
The model predicts what word should come next.
For example:
“The weather outside was cold, so I put on my…”
GPT learns that words such as “coat”, “jacket”, or “jumper” are likely next words. By repeating this prediction process billions of times across enormous amounts of text, GPT became surprisingly good at writing paragraphs, answering questions, generating code, and holding conversations.
How the Models Evolved Over Time
Neither BERT nor GPT stood still after their initial releases. Researchers improved them steadily, making each generation larger and more capable.
timeline
title From Transformer to Modern AI
2017 : Transformer - Attention Is All You Need
2018 : BERT - Google
: GPT-1 - OpenAI
2019 : GPT-2 - OpenAI
2020 : GPT-3 - OpenAI
2022 : ChatGPT - OpenAI
2023 : GPT-4 - OpenAI
: Llama - MetaEach version of GPT was significantly larger than the last. GPT-3, released in 2020, was trained on hundreds of billions of words. By the time ChatGPT appeared in late 2022, the underlying technology had become capable enough to surprise even experienced researchers.
Why the Original Paper Became Such a Success
The success of GPT and BERT also turned the original Transformer paper into one of the most cited and influential papers in modern AI.
Because the paper solved several major problems at once:
| Problem | How the Transformer Solved It |
|---|---|
| Slow sequential training | Parallel processing of all words simultaneously |
| Weak long-range understanding | Attention connecting any two words directly |
| Hard to scale | Architecture grows efficiently with more data and compute |
| Limited to one type of task | Flexible design adapted for understanding and generation |
Over time, nearly every major language model began using Transformer ideas. Today, systems like chatbots, coding assistants, search models, and image generators still build on the same foundation introduced in 2017.
The Bigger Lesson
One of the most interesting things about the Transformer paper is that it was not originally designed to create the AI tools we know today.
It started as a research idea to improve language translation.
Yet by solving one technical problem in a smarter way, it quietly changed the direction of AI research.
Sometimes the biggest technology shifts begin with a simple idea that people do not fully understand at first.
In this case, the idea was surprisingly simple:
Pay attention to what matters.
Further Reading
- BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)
- GPT-1: Improving Language Understanding by Generative Pre-Training (Radford et al., 2018)
- Attention Is All You Need - Original Paper (Vaswani et al., 2017)