The Research Paper That Changed AI - The Story Behind GPT and BERT (Part 1)

2025-12-15

Modern AI tools such as chatbots, search assistants, and text generators may feel new, but many of them can be traced back to one important research paper published in 2017: “Attention Is All You Need”. It introduced a new model architecture known as the Transformer, which later became the foundation for systems like GPT and BERT.

This is Part 1 of a two-part series. Part 2 covers how the Transformer became GPT and BERT. Part 3 goes deep inside GPT’s architecture.

The Problem With Earlier Language Models

Before this paper, language models mostly relied on systems called RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks). These models processed words one by one in order.

While useful, they had two major problems:

They were slow to train because each word had to wait for the previous one to be processed
They struggled to understand relationships between words that were far apart in a sentence

flowchart LR
    subgraph RNN["RNN / LSTM - Sequential"]
        direction LR
        r1[The] --> r2[weather] --> r3[was] --> r4[terrible]
    end
    subgraph TF["Transformer - Parallel with Attention"]
        direction LR
        t1[The] & t2[weather] & t3[was] & t4[terrible]
        t1 <-.->|attention| t4
        t2 <-.->|attention| t4
    end

    style RNN fill:#3b1f1f,color:#fca5a5
    style TF fill:#1e3a2f,color:#86efac

For example, imagine the sentence:

“The football match was delayed because the weather was terrible.”

To understand what caused the delay, the model must connect “weather” to “delayed” even though several words sit in between. Older models often struggled with these long-distance relationships.

The Big Idea: Attention

The 2017 paper introduced a simpler but powerful idea called attention.

Instead of reading words one after another, the Transformer could look at all words in a sentence at the same time and decide which words mattered most to understanding meaning. This made training much faster and improved performance on language tasks.

In simple terms, attention works like this:

If a model reads:

“Sarah dropped the glass because it was fragile.”

The system tries to understand what “it” refers to. Using attention, it can connect “fragile” to “glass” rather than “Sarah”.

This ability to notice relationships helped models understand context much better.

The Original Transformer Architecture

The original Transformer was built with two main parts:

Encoder - reads and understands the input text
Decoder - generates the output text

flowchart TB
    IN([Input Text]) --> E

    subgraph E["Encoder - Understanding"]
        EA[Self-Attention] --> EF[Feed Forward]
    end

    E --> CTX[Encoded Context]
    CTX --> D

    subgraph D["Decoder - Generating"]
        DA[Masked Self-Attention] --> DC[Cross-Attention] --> DF[Feed Forward]
    end

    D --> OUT([Output Text])

    style E fill:#1e3a5f,color:#93c5fd
    style D fill:#14532d,color:#86efac
    style CTX fill:#3b2f0f,color:#fcd34d

The encoder and decoder worked together, especially for translation tasks such as converting English into French. Because the system processed information in parallel rather than word by word, it was both faster and easier to scale.

At the time, the paper was mainly focused on machine translation, not chatbots or large language models. However, the researchers also suggested that this architecture could work well for many other language tasks in the future. That prediction turned out to be very important.

Why This Paper Became So Important

Interestingly, the paper did not immediately become famous outside research circles.

At first, it was simply seen as a better method for handling language tasks. But over time, researchers realised the Transformer solved a major problem: it allowed much larger models to be trained more efficiently than earlier systems.

That single breakthrough opened the door to a completely new generation of AI.

And this is where the story of GPT and BERT begins - covered in Part 2.

The Problem With Earlier Language Models

The Big Idea: Attention

The Original Transformer Architecture

Why This Paper Became So Important

Further Reading