Learning Position โ Trainable Embeddings from BERT to GPT
by John Robinson @johnrobinsnPart 2: Learned Positional Embeddings #
Why Transformers Need Help Knowing Where | A Six-Part Series
๐ Series Navigation:
- Part 1: The Position Problem & Sinusoidal PE
- Part 2: Learned Positional Embeddings โ You are here
- Part 3: RoPE (Rotary Position Embeddings) - Coming Soon
- Part 4: ALiBi (Attention with Linear Biases) - Coming Soon
- Part 5: PoPE (Polar Coordinate Embeddings) - Coming Soon
- Part 6: Practitioner's Guide - Coming Soon
Introduction #
In Part 1, we explored sinusoidal positional encodings โ a fixed, mathematical approach to encoding position. But what if we just... let the model figure it out?
Learned positional embeddings take a radically simple approach: create a trainable lookup table where each position gets its own learnable vector. This is the approach used by BERT, GPT-2, and many other influential models.
This notebook covers:
- How learned PE works (spoiler: it's just
nn.Embedding) - Implementation and visualization
- The critical extrapolation limitation
- When to choose learned vs. fixed encodings
๐ Paper: BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)
# Install dependencies (run once)
!pip install -q numpy matplotlib seaborn torch
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple
# Set visualization styles
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = [10, 6]
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
print("โ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
โ All libraries imported successfully!
PyTorch version: 2.9.1+cu128
1. The Simplest Approach: Just Learn It #
Motivation #
Instead of designing a clever mathematical function to encode position, learned positional embeddings ask: Why not let the model learn the best way to represent each position?
โ ๏ธ Key Clarification: What's Given vs. What's Learned #
This is a common point of confusion, so let's be precise:
| Component | Status | Explanation |
|---|---|---|
| Position index (0, 1, 2, ...) | Explicitly provided | The model is told "this token is at position 5" โ it doesn't discover this from context |
| Embedding vector for each position | Learned via backpropagation | The model learns what values best represent "position 5" |
The position index IS the ground truth signal. When we process a sequence, we don't ask the model to figure out which position each token occupies โ we explicitly look up row 0 for position 0, row 5 for position 5, etc. This mapping is hardwired.
What's learned is HOW to represent each position โ the actual d-dimensional vector of floats that gets added to the token embedding.
๐ Analogy: Think of a classroom seating chart. The seat numbers (positions 0, 1, 2...) are fixed and labeled on each desk. What's "learned" over time is the personality/reputation associated with each seat ("the front row is for eager students"). The model learns useful representations, but it's never confused about which seat is which.
How It Works #
- Create a lookup table (matrix) with one row per position, randomly initialized
- During training, use the known position index to look up the corresponding row
- Backpropagation updates the embedding values to minimize loss
Where position_embedding[i] means: "look up row
Mathematical Formulation #
Given:
- Maximum sequence length
- Embedding dimension
The position embedding matrix
For a token at position
What This Is NOT #
To further clarify, learned positional embeddings are not:
- Learning to infer position from surrounding context
- Discovering position through attention patterns
- Figuring out "where am I?" from the content
Those would be fascinating (and much harder!) problems. Instead, this is simply: "Given that I know this token is at position
2. Implementation #
Let's implement learned positional embeddings as a PyTorch module.
class LearnedPositionalEmbedding(nn.Module):
"""
Learned positional embeddings as used in BERT and GPT-2.
Each position (0, 1, 2, ..., max_len-1) has its own trainable
embedding vector of dimension d_model.
Args:
d_model: Dimension of the embeddings
max_len: Maximum sequence length
dropout: Dropout probability
"""
def __init__(self, d_model: int, max_len: int = 512, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# The key component: a learnable embedding table
# Shape: (max_len, d_model)
self.position_embeddings = nn.Embedding(max_len, d_model)
# Register position indices as a buffer (not a parameter)
# This avoids creating new tensors on every forward pass
self.register_buffer(
'position_ids',
torch.arange(max_len).unsqueeze(0) # Shape: (1, max_len)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Add learned positional embeddings to input.
Args:
x: Input tensor of shape (batch_size, seq_len, d_model)
Returns:
Tensor with positional embeddings added
"""
seq_len = x.size(1)
# Get position IDs for current sequence length
position_ids = self.position_ids[:, :seq_len] # Shape: (1, seq_len)
# Look up position embeddings
position_embeds = self.position_embeddings(position_ids) # (1, seq_len, d_model)
# Add to input (broadcasts across batch dimension)
x = x + position_embeds
return self.dropout(x)
# Test the module
batch_size = 2
seq_len = 10
d_model = 64
learned_pe = LearnedPositionalEmbedding(d_model, max_len=512)
x = torch.randn(batch_size, seq_len, d_model)
output = learned_pe(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nPosition embedding table shape: {learned_pe.position_embeddings.weight.shape}")
print(f" โ {512} positions ร {d_model} dimensions")
print(f"\nTrainable parameters: {learned_pe.position_embeddings.weight.numel():,}")
print("\nโ LearnedPositionalEmbedding module working!")
Input shape: torch.Size([2, 10, 64])
Output shape: torch.Size([2, 10, 64])
Position embedding table shape: torch.Size([512, 64])
โ 512 positions ร 64 dimensions
Trainable parameters: 32,768
โ LearnedPositionalEmbedding module working!
3. What Do Learned Embeddings Look Like After Training? #
At initialization, learned positional embeddings are just random noise โ there's no meaningful structure. However, after training on large corpora, learned embeddings typically develop smooth, structured patterns that often resemble sinusoidal encodings.
Research has shown that:
- Nearby positions develop similar embeddings โ the model learns that position 5 should be more similar to position 6 than to position 100
- Smooth gradients emerge โ embeddings change gradually across positions rather than randomly
- Task-specific patterns appear โ unlike fixed sinusoidal PE, learned embeddings can adapt to the specific positional patterns useful for the training task
๏ฟฝ๏ฟฝ Empirical finding: When visualized as heatmaps, trained learned embeddings often show wave-like patterns similar to sinusoidal PE, suggesting the model "rediscovers" useful mathematical structure through gradient descent.
This convergence to structured patterns provides some validation that sinusoidal encodings were a reasonable design choice โ but learned embeddings have the flexibility to deviate when beneficial for the task.
4. The Parameter Cost: Pros, Cons, and Trade-offs #
Advantages #
| Aspect | Benefit |
|---|---|
| โ Flexibility | Can adapt to task-specific positional patterns |
| โ Simplicity | Very easy to implement (just nn.Embedding) |
| โ Proven | Used successfully in BERT, GPT-2, GPT-3 |
Disadvantages #
| Aspect | Drawback |
|---|---|
| โ Fixed Length | Cannot handle sequences longer than max_len |
| โ Extrapolation | Poor generalization to unseen positions |
| โ Parameters | Adds |
| โ Data Hungry | Needs enough data to learn good representations |
The Parameter Overhead #
Unlike sinusoidal PE (which has zero trainable parameters), learned PE adds significant overhead:
These parameters consume GPU memory and must be trained โ positions that appear rarely in training data may have poorly-learned representations. For long-context models (32k, 100k+ tokens), this becomes impractical: a 100k ร 4096 table would add 400M parameters just for positions. This is why modern long-context models use RoPE or ALiBi instead.
Parameter Count Comparison #
# Parameter count for different configurations
configs = [
("BERT-base", 512, 768),
("GPT-2 small", 1024, 768),
("GPT-2 medium", 1024, 1024),
("GPT-2 large", 1024, 1280),
]
print("=" * 60)
print("LEARNED PE PARAMETER COUNTS")
print("=" * 60)
print(f"\n{'Model':<15} {'Max Len':<10} {'d_model':<10} {'PE Params':<15}")
print("-" * 60)
for name, max_len, d_model in configs:
params = max_len * d_model
print(f"{name:<15} {max_len:<10} {d_model:<10} {params:,}")
print("-" * 60)
print("\nNote: Sinusoidal PE has 0 trainable parameters!")
print(" These PE parameters are ~0.4-1.3M per model.")
============================================================
LEARNED PE PARAMETER COUNTS
============================================================
Model Max Len d_model PE Params
------------------------------------------------------------
BERT-base 512 768 393,216
GPT-2 small 1024 768 786,432
GPT-2 medium 1024 1024 1,048,576
GPT-2 large 1024 1280 1,310,720
------------------------------------------------------------
Note: Sinusoidal PE has 0 trainable parameters!
These PE parameters are ~0.4-1.3M per model.
5. The Extrapolation Problem #
The most critical limitation of learned positional embeddings: they cannot handle sequences longer than max_len.
This is because the embedding table has a fixed size. Position 512 in a model trained with max_len=512 simply doesn't exist โ there's no vector to look up!
# Demonstrate the extrapolation limitation
print("=" * 70)
print("LEARNED PE: EXTRAPOLATION LIMITATION")
print("=" * 70)
# Create module with max_len=100
learned_pe_short = LearnedPositionalEmbedding(d_model=64, max_len=100)
test_lengths = [50, 100, 101]
print(f"\nModule created with max_len=100")
print(f"\n{'Sequence Length':<20} {'Status':<30}")
print("-" * 70)
for length in test_lengths:
try:
x_test = torch.randn(1, length, 64)
_ = learned_pe_short(x_test)
print(f"{length:<20} โ Works")
except Exception as e:
error_type = type(e).__name__
print(f"{length:<20} โ FAILS ({error_type})")
print("\n" + "=" * 70)
print("CRITICAL: Sequences longer than max_len cause errors!")
print("=" * 70)
print("\nThis is why modern LLMs (2023+) prefer RoPE or ALiBi:")
print("โข RoPE: Computes rotation on-the-fly, no position limit")
print("โข ALiBi: Computes bias on-the-fly, no position limit")
======================================================================
LEARNED PE: EXTRAPOLATION LIMITATION
======================================================================
Module created with max_len=100
Sequence Length Status
----------------------------------------------------------------------
50 โ Works
100 โ Works
101 โ FAILS (RuntimeError)
======================================================================
CRITICAL: Sequences longer than max_len cause errors!
======================================================================
This is why modern LLMs (2023+) prefer RoPE or ALiBi:
โข RoPE: Computes rotation on-the-fly, no position limit
โข ALiBi: Computes bias on-the-fly, no position limit
6. Should You Use Learned PE? (Probably Not for New Projects) #
The Honest Assessment #
For new projects in 2025+, there's little reason to choose learned positional embeddings over alternatives like RoPE:
| Approach | Parameters | Extrapolation | Complexity |
|---|---|---|---|
| Learned PE | โ Hard limit at max_len |
Simple | |
| Sinusoidal PE | 0 | โ ๏ธ Degrades beyond training | Simple |
| RoPE | 0 | โ Generalizes well | Moderate |
| ALiBi | 0 | โ Generalizes well | Simple |
RoPE has become the de facto standard for modern LLMs (LLaMA, Mistral, GPT-4) because it offers:
- Zero additional parameters
- No sequence length limit
- Strong empirical performance
When Learned PE Still Makes Sense #
-
Fine-tuning existing models: If you're using BERT, GPT-2, or other pre-trained models that already use learned PE, you'd continue with that architecture
-
Fixed-length classification: Tasks like sentiment analysis where you'll never exceed
max_lenand don't need extrapolation -
Educational purposes:
nn.Embeddingis trivially simple to understand compared to RoPE's rotation matrices -
Reproducing published results: When replicating papers that used learned PE
The Bottom Line #
๐ฏ Recommendation: Unless you have a specific reason to use learned PE (backwards compatibility, reproducing prior work), default to RoPE for new projects. It's what the field has converged on, and the extrapolation problem alone makes learned PE a risky choice for any application where sequence lengths might grow.
We include learned PE in this series because it's historically important (BERT, GPT-2) and conceptually illuminating โ but it's largely a legacy approach at this point.
7. Complete BERT-Style Embedding Layer #
Let's build a complete embedding layer that combines token embeddings with learned positional embeddings, exactly as BERT does it.
class BERTStyleEmbedding(nn.Module):
"""
Complete BERT-style embedding layer with:
- Token embeddings
- Learned positional embeddings
- (Optional) Segment/type embeddings
- Layer normalization
- Dropout
"""
def __init__(
self,
vocab_size: int,
d_model: int,
max_len: int = 512,
dropout: float = 0.1,
use_segment_embeddings: bool = False,
n_segments: int = 2
):
super().__init__()
# Token embeddings
self.token_embeddings = nn.Embedding(vocab_size, d_model)
# Position embeddings
self.position_embeddings = nn.Embedding(max_len, d_model)
# Segment embeddings (optional, for sentence pair tasks)
self.use_segment_embeddings = use_segment_embeddings
if use_segment_embeddings:
self.segment_embeddings = nn.Embedding(n_segments, d_model)
# Layer norm and dropout
self.layer_norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
# Position IDs buffer
self.register_buffer(
'position_ids',
torch.arange(max_len).unsqueeze(0)
)
def forward(
self,
input_ids: torch.Tensor,
segment_ids: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Args:
input_ids: Token indices, shape (batch, seq_len)
segment_ids: Segment indices, shape (batch, seq_len)
Returns:
Combined embeddings, shape (batch, seq_len, d_model)
"""
seq_len = input_ids.size(1)
# Get embeddings
token_emb = self.token_embeddings(input_ids)
position_emb = self.position_embeddings(self.position_ids[:, :seq_len])
# Combine
embeddings = token_emb + position_emb
if self.use_segment_embeddings and segment_ids is not None:
segment_emb = self.segment_embeddings(segment_ids)
embeddings = embeddings + segment_emb
# Normalize and dropout
embeddings = self.layer_norm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
# Test
bert_embedding = BERTStyleEmbedding(
vocab_size=30000,
d_model=768,
max_len=512
)
# Simulate input token IDs
input_ids = torch.randint(0, 30000, (2, 128))
output = bert_embedding(input_ids)
print(f"Input shape: {input_ids.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTotal embedding parameters:")
print(f" Token embeddings: {30000 * 768:,}")
print(f" Position embeddings: {512 * 768:,}")
print(f" Total: {30000 * 768 + 512 * 768:,}")
print("\nโ BERT-style embedding layer working!")
Input shape: torch.Size([2, 128])
Output shape: torch.Size([2, 128, 768])
Total embedding parameters:
Token embeddings: 23,040,000
Position embeddings: 393,216
Total: 23,433,216
โ BERT-style embedding layer working!
8. Summary and Key Takeaways #
What We Learned #
-
Learned PE is Simple: Just an
nn.Embeddingtable- Position
โ learnable vector - Added to token embedding
- Position
-
Trade-offs:
Pro Con Flexible, adapts to task Fixed max length Simple implementation Cannot extrapolate Proven in BERT, GPT-2 More parameters -
The Extrapolation Problem:
- Cannot handle
seq_len > max_len - This motivated RoPE and ALiBi
- Cannot handle
-
Use When:
- Fixed-length tasks (classification, NER)
- Following BERT/GPT-2 architecture
- Have large training data
Coming Up Next #
Part 3: RoPE (Rotary Position Embeddings)
- The rotation trick that enables unlimited sequence length
- Why LLaMA, Mistral, and GPT-4 use RoPE
- Elegant math: complex numbers and rotation matrices
References #
-
Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers."
-
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2)
- Learned positional embeddings with max_len=1024
Last updated: January 2026
Share on Twitter | Discuss on Twitter
John Robinson © 2022-2025