In 2017, Google researchers published a paper with a bold claim: "Attention Is All You Need." They weren't talking about meditation – they were announcing the death of sequential processing in AI.
Transformers shattered the RNN paradigm by processing entire sequences simultaneously. Instead of reading word by word like humans do, Transformers see the entire document at once and decide which parts deserve attention. Think of it like having superhuman peripheral vision – you can focus on multiple important details simultaneously without losing track of the big picture.
Self-Attention Mechanism
Input Sequence: "The cat sat on the mat"
Step 1: Create Query (Q), Key (K), Value (V) for each word
────────────────────────────────────────────────────────
Word │ Query Key Value
─────────────────────────────────
The │ q₁ k₁ v₁
cat │ q₂ k₂ v₂
sat │ q₃ k₃ v₃
on │ q₄ k₄ v₄
the │ q₅ k₅ v₅
mat │ q₆ k₆ v₆
Step 2: Compute attention scores (Q·K relationships)
────────────────────────────────────────────────────
The cat sat on the mat
The │ 0.9 0.1 0.2 0.1 0.8 0.1 ← "The" attends to "the"
cat │ 0.1 0.9 0.3 0.1 0.1 0.6 ← "cat" attends to "mat"
sat │ 0.2 0.7 0.9 0.4 0.2 0.3 ← "sat" attends to "cat"
...
Step 3: Weighted combination of Values based on attention
─────────────────────────────────────────────────────────
Output₁ = 0.9×v₁ + 0.1×v₂ + 0.2×v₃ + ... (for "The")
Output₂ = 0.1×v₁ + 0.9×v₂ + 0.3×v₃ + ... (for "cat")
...
Result: Each word's representation includes context from ALL other words
Self-Attention Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadSelfAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads
# Linear projections for Q, K, V
self.query = nn.Linear(d_model, d_model)
self.key = nn.Linear(d_model, d_model)
self.value = nn.Linear(d_model, d_model)
self.output = nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.shape
# Create Q, K, V matrices
Q = self.query(x) # [batch, seq_len, d_model]
K = self.key(x)
V = self.value(x)
# Reshape for multi-head attention
Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
attention_weights = F.softmax(scores, dim=-1)
# Apply attention to values
attention_output = torch.matmul(attention_weights, V)
# Concatenate heads and project
attention_output = attention_output.transpose(1, 2).contiguous()
attention_output = attention_output.view(batch_size, seq_len, d_model)
return self.output(attention_output)
# Example usage
model = MultiHeadSelfAttention(d_model=512, num_heads=8)
sequence = torch.randn(32, 100, 512) # Batch=32, SeqLen=100, Features=512
output = model(sequence)
print(f"Attention output: {output.shape}") # [32, 100, 512]
Here's how the magic works: every word (or token) gets three mathematical representations – a query, key, and value. The transformer computes attention by asking "how much should this word care about every other word in the sequence?" It does this in parallel across multiple "attention heads," allowing the model to capture different types of relationships simultaneously – grammatical, semantic, and contextual.
The breakthrough came from abandoning sequential processing entirely. Where RNNs crawl through sequences one step at a time, Transformers process everything in parallel. This makes them incredibly fast on modern GPUs and allows them to capture long-range dependencies that would vanish in traditional RNNs.
But there's a catch. Attention scales quadratically – double your document length, and you quadruple the computational cost. This makes long documents expensive to process and creates a new class of vulnerabilities. When your AI system accepts natural language input, it becomes vulnerable to prompt injection attacks – malicious instructions disguised as innocent text.