The Research Paper That Changed AI - The Story Behind GPT and BERT (Part 1)
Modern AI tools such as chatbots, search assistants, and text generators may feel new, but many of them can be traced back to one important research paper published in 2017: “Attention Is All You Need”. It introduced a new model architecture known as the Transformer, which later became the foundation for systems like GPT and BERT.
This is Part 1 of a two-part series. Part 2 covers how the Transformer became GPT and BERT. Part 3 goes deep inside GPT’s architecture.
The Problem With Earlier Language Models
Before this paper, language models mostly relied on systems called RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks). These models processed words one by one in order.
While useful, they had two major problems:
- They were slow to train because each word had to wait for the previous one to be processed
- They struggled to understand relationships between words that were far apart in a sentence
flowchart LR
subgraph RNN["RNN / LSTM - Sequential"]
direction LR
r1[The] --> r2[weather] --> r3[was] --> r4[terrible]
end
subgraph TF["Transformer - Parallel with Attention"]
direction LR
t1[The] & t2[weather] & t3[was] & t4[terrible]
t1 <-.->|attention| t4
t2 <-.->|attention| t4
end
style RNN fill:#3b1f1f,color:#fca5a5
style TF fill:#1e3a2f,color:#86efacFor example, imagine the sentence:
“The football match was delayed because the weather was terrible.”
To understand what caused the delay, the model must connect “weather” to “delayed” even though several words sit in between. Older models often struggled with these long-distance relationships.
The Big Idea: Attention
The 2017 paper introduced a simpler but powerful idea called attention.
Instead of reading words one after another, the Transformer could look at all words in a sentence at the same time and decide which words mattered most to understanding meaning. This made training much faster and improved performance on language tasks.
In simple terms, attention works like this:
If a model reads:
“Sarah dropped the glass because it was fragile.”
The system tries to understand what “it” refers to. Using attention, it can connect “fragile” to “glass” rather than “Sarah”.
This ability to notice relationships helped models understand context much better.
The Original Transformer Architecture
The original Transformer was built with two main parts:
- Encoder - reads and understands the input text
- Decoder - generates the output text
flowchart TB
IN([Input Text]) --> E
subgraph E["Encoder - Understanding"]
EA[Self-Attention] --> EF[Feed Forward]
end
E --> CTX[Encoded Context]
CTX --> D
subgraph D["Decoder - Generating"]
DA[Masked Self-Attention] --> DC[Cross-Attention] --> DF[Feed Forward]
end
D --> OUT([Output Text])
style E fill:#1e3a5f,color:#93c5fd
style D fill:#14532d,color:#86efac
style CTX fill:#3b2f0f,color:#fcd34dThe encoder and decoder worked together, especially for translation tasks such as converting English into French. Because the system processed information in parallel rather than word by word, it was both faster and easier to scale.
At the time, the paper was mainly focused on machine translation, not chatbots or large language models. However, the researchers also suggested that this architecture could work well for many other language tasks in the future. That prediction turned out to be very important.
Why This Paper Became So Important
Interestingly, the paper did not immediately become famous outside research circles.
At first, it was simply seen as a better method for handling language tasks. But over time, researchers realised the Transformer solved a major problem: it allowed much larger models to be trained more efficiently than earlier systems.
That single breakthrough opened the door to a completely new generation of AI.
And this is where the story of GPT and BERT begins - covered in Part 2.
Further Reading
- Attention Is All You Need - Original Paper (Vaswani et al., 2017)
- Google Research Blog