06 Jan 2026positional encoding learned embeddings transformers

Learning Position — Trainable Embeddings from BERT to GPT

by John Robinson @johnrobinsn

Part 2: Learned Positional Embeddings

Why Transformers Need Help Knowing Where | A Six-Part Series

📚 Series Navigation:

Part 1: The Position Problem & Sinusoidal PE
Part 2: Learned Positional Embeddings ← You are here
Part 3: RoPE (Rotary Position Embeddings) - Coming Soon
Part 4: ALiBi (Attention with Linear Biases) - Coming Soon
Part 5: PoPE (Polar Coordinate Embeddings) - Coming Soon
Part 6: Practitioner's Guide - Coming Soon

Introduction

In Part 1, we explored sinusoidal positional encodings — a fixed, mathematical approach to encoding position. But what if we just... let the model figure it out?

Learned positional embeddings take a radically simple approach: create a trainable lookup table where each position gets its own learnable vector. This is the approach used by BERT, GPT-2, and many other influential models.

This notebook covers:

How learned PE works (spoiler: it's just nn.Embedding)
Implementation and visualization
The critical extrapolation limitation
When to choose learned vs. fixed encodings

📄 Paper: BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)

# Install dependencies (run once)
!pip install -q numpy matplotlib seaborn torch

# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple

# Set visualization styles
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = [10, 6]

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("✓ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

✓ All libraries imported successfully!
PyTorch version: 2.9.1+cu128

1. The Simplest Approach: Just Learn It

Motivation

Instead of designing a clever mathematical function to encode position, learned positional embeddings ask: Why not let the model learn the best way to represent each position?

⚠️ Key Clarification: What's Given vs. What's Learned

This is a common point of confusion, so let's be precise:

Component	Status	Explanation
Position index (0, 1, 2, ...)	Explicitly provided	The model is told "this token is at position 5" — it doesn't discover this from context
Embedding vector for each position	Learned via backpropagation	The model learns what values best represent "position 5"

The position index IS the ground truth signal. When we process a sequence, we don't ask the model to figure out which position each token occupies — we explicitly look up row 0 for position 0, row 5 for position 5, etc. This mapping is hardwired.

What's learned is HOW to represent each position — the actual d-dimensional vector of floats that gets added to the token embedding.

🔑 Analogy: Think of a classroom seating chart. The seat numbers (positions 0, 1, 2...) are fixed and labeled on each desk. What's "learned" over time is the personality/reputation associated with each seat ("the front row is for eager students"). The model learns useful representations, but it's never confused about which seat is which.

How It Works

Create a lookup table (matrix) with one row per position, randomly initialized
During training, use the known position index to look up the corresponding row
Backpropagation updates the embedding values to minimize loss

\text{output}_i = \text{token\_embedding}_i + \text{position\_embedding}[i]

Where position_embedding[i] means: "look up row $i$ in the embedding table" — the index $i$ is known, the values in that row are learned.

Mathematical Formulation

Given:

Maximum sequence length $L$
Embedding dimension $d$

The position embedding matrix $P \in \mathbb{R}^{L \times d}$ is initialized randomly and optimized during training:

P = \begin{bmatrix} p_0 \\ p_1 \\ \vdots \\ p_{L-1} \end{bmatrix}

For a token at position $i$ :

e_i^{\text{final}} = e_i^{\text{token}} + p_i

What This Is NOT

To further clarify, learned positional embeddings are not:

Learning to infer position from surrounding context
Discovering position through attention patterns
Figuring out "where am I?" from the content

Those would be fascinating (and much harder!) problems. Instead, this is simply: "Given that I know this token is at position $i$ , what's the best vector to represent that position?"

2. Implementation

Let's implement learned positional embeddings as a PyTorch module.

class LearnedPositionalEmbedding(nn.Module):
    """
    Learned positional embeddings as used in BERT and GPT-2.
    
    Each position (0, 1, 2, ..., max_len-1) has its own trainable 
    embedding vector of dimension d_model.
    
    Args:
        d_model: Dimension of the embeddings
        max_len: Maximum sequence length
        dropout: Dropout probability
    """
    
    def __init__(self, d_model: int, max_len: int = 512, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # The key component: a learnable embedding table
        # Shape: (max_len, d_model)
        self.position_embeddings = nn.Embedding(max_len, d_model)
        
        # Register position indices as a buffer (not a parameter)
        # This avoids creating new tensors on every forward pass
        self.register_buffer(
            'position_ids', 
            torch.arange(max_len).unsqueeze(0)  # Shape: (1, max_len)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Add learned positional embeddings to input.
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
        
        Returns:
            Tensor with positional embeddings added
        """
        seq_len = x.size(1)
        
        # Get position IDs for current sequence length
        position_ids = self.position_ids[:, :seq_len]  # Shape: (1, seq_len)
        
        # Look up position embeddings
        position_embeds = self.position_embeddings(position_ids)  # (1, seq_len, d_model)
        
        # Add to input (broadcasts across batch dimension)
        x = x + position_embeds
        
        return self.dropout(x)


# Test the module
batch_size = 2
seq_len = 10
d_model = 64

learned_pe = LearnedPositionalEmbedding(d_model, max_len=512)
x = torch.randn(batch_size, seq_len, d_model)
output = learned_pe(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nPosition embedding table shape: {learned_pe.position_embeddings.weight.shape}")
print(f"  → {512} positions × {d_model} dimensions")
print(f"\nTrainable parameters: {learned_pe.position_embeddings.weight.numel():,}")
print("\n✓ LearnedPositionalEmbedding module working!")

Input shape: torch.Size([2, 10, 64])
Output shape: torch.Size([2, 10, 64])

Position embedding table shape: torch.Size([512, 64])
→ 512 positions × 64 dimensions

Trainable parameters: 32,768

✓ LearnedPositionalEmbedding module working!

3. What Do Learned Embeddings Look Like After Training?

At initialization, learned positional embeddings are just random noise — there's no meaningful structure. However, after training on large corpora, learned embeddings typically develop smooth, structured patterns that often resemble sinusoidal encodings.

Research has shown that:

Nearby positions develop similar embeddings — the model learns that position 5 should be more similar to position 6 than to position 100
Smooth gradients emerge — embeddings change gradually across positions rather than randomly
Task-specific patterns appear — unlike fixed sinusoidal PE, learned embeddings can adapt to the specific positional patterns useful for the training task

�� Empirical finding: When visualized as heatmaps, trained learned embeddings often show wave-like patterns similar to sinusoidal PE, suggesting the model "rediscovers" useful mathematical structure through gradient descent.

This convergence to structured patterns provides some validation that sinusoidal encodings were a reasonable design choice — but learned embeddings have the flexibility to deviate when beneficial for the task.

4. The Parameter Cost: Pros, Cons, and Trade-offs

Advantages

Aspect	Benefit
✅ Flexibility	Can adapt to task-specific positional patterns
✅ Simplicity	Very easy to implement (just `nn.Embedding`)
✅ Proven	Used successfully in BERT, GPT-2, GPT-3

Disadvantages

Aspect	Drawback
❌ Fixed Length	Cannot handle sequences longer than `max_len`
❌ Extrapolation	Poor generalization to unseen positions
❌ Parameters	Adds $L \times d$ trainable parameters
❌ Data Hungry	Needs enough data to learn good representations

The Parameter Overhead

Unlike sinusoidal PE (which has zero trainable parameters), learned PE adds significant overhead:

\text{PE Parameters} = \text{max\_len} \times d_{\text{model}}

These parameters consume GPU memory and must be trained — positions that appear rarely in training data may have poorly-learned representations. For long-context models (32k, 100k+ tokens), this becomes impractical: a 100k × 4096 table would add 400M parameters just for positions. This is why modern long-context models use RoPE or ALiBi instead.

Parameter Count Comparison

# Parameter count for different configurations
configs = [
    ("BERT-base", 512, 768),
    ("GPT-2 small", 1024, 768),
    ("GPT-2 medium", 1024, 1024),
    ("GPT-2 large", 1024, 1280),
]

print("=" * 60)
print("LEARNED PE PARAMETER COUNTS")
print("=" * 60)
print(f"\n{'Model':<15} {'Max Len':<10} {'d_model':<10} {'PE Params':<15}")
print("-" * 60)

for name, max_len, d_model in configs:
    params = max_len * d_model
    print(f"{name:<15} {max_len:<10} {d_model:<10} {params:,}")

print("-" * 60)
print("\nNote: Sinusoidal PE has 0 trainable parameters!")
print("      These PE parameters are ~0.4-1.3M per model.")

============================================================
LEARNED PE PARAMETER COUNTS
============================================================

Model Max Len d_model PE Params
------------------------------------------------------------
BERT-base 512 768 393,216
GPT-2 small 1024 768 786,432
GPT-2 medium 1024 1024 1,048,576
GPT-2 large 1024 1280 1,310,720
------------------------------------------------------------

Note: Sinusoidal PE has 0 trainable parameters!
These PE parameters are ~0.4-1.3M per model.

5. The Extrapolation Problem

The most critical limitation of learned positional embeddings: they cannot handle sequences longer than max_len.

This is because the embedding table has a fixed size. Position 512 in a model trained with max_len=512 simply doesn't exist — there's no vector to look up!

# Demonstrate the extrapolation limitation
print("=" * 70)
print("LEARNED PE: EXTRAPOLATION LIMITATION")
print("=" * 70)

# Create module with max_len=100
learned_pe_short = LearnedPositionalEmbedding(d_model=64, max_len=100)

test_lengths = [50, 100, 101]

print(f"\nModule created with max_len=100")
print(f"\n{'Sequence Length':<20} {'Status':<30}")
print("-" * 70)

for length in test_lengths:
    try:
        x_test = torch.randn(1, length, 64)
        _ = learned_pe_short(x_test)
        print(f"{length:<20} ✓ Works")
    except Exception as e:
        error_type = type(e).__name__
        print(f"{length:<20} ✗ FAILS ({error_type})")

print("\n" + "=" * 70)
print("CRITICAL: Sequences longer than max_len cause errors!")
print("=" * 70)
print("\nThis is why modern LLMs (2023+) prefer RoPE or ALiBi:")
print("• RoPE: Computes rotation on-the-fly, no position limit")
print("• ALiBi: Computes bias on-the-fly, no position limit")

======================================================================
LEARNED PE: EXTRAPOLATION LIMITATION
======================================================================

Module created with max_len=100

Sequence Length Status
----------------------------------------------------------------------
50 ✓ Works
100 ✓ Works
101 ✗ FAILS (RuntimeError)

======================================================================
CRITICAL: Sequences longer than max_len cause errors!
======================================================================

This is why modern LLMs (2023+) prefer RoPE or ALiBi:
• RoPE: Computes rotation on-the-fly, no position limit
• ALiBi: Computes bias on-the-fly, no position limit

6. Should You Use Learned PE? (Probably Not for New Projects)

The Honest Assessment

For new projects in 2025+, there's little reason to choose learned positional embeddings over alternatives like RoPE:

Approach	Parameters	Extrapolation	Complexity
Learned PE	$L \times d$	❌ Hard limit at `max_len`	Simple
Sinusoidal PE	0	⚠️ Degrades beyond training	Simple
RoPE	0	✅ Generalizes well	Moderate
ALiBi	0	✅ Generalizes well	Simple

RoPE has become the de facto standard for modern LLMs (LLaMA, Mistral, GPT-4) because it offers:

Zero additional parameters
No sequence length limit
Strong empirical performance

When Learned PE Still Makes Sense

Fine-tuning existing models: If you're using BERT, GPT-2, or other pre-trained models that already use learned PE, you'd continue with that architecture
Fixed-length classification: Tasks like sentiment analysis where you'll never exceed max_len and don't need extrapolation
Educational purposes: nn.Embedding is trivially simple to understand compared to RoPE's rotation matrices
Reproducing published results: When replicating papers that used learned PE

The Bottom Line

🎯 Recommendation: Unless you have a specific reason to use learned PE (backwards compatibility, reproducing prior work), default to RoPE for new projects. It's what the field has converged on, and the extrapolation problem alone makes learned PE a risky choice for any application where sequence lengths might grow.

We include learned PE in this series because it's historically important (BERT, GPT-2) and conceptually illuminating — but it's largely a legacy approach at this point.

7. Complete BERT-Style Embedding Layer

Let's build a complete embedding layer that combines token embeddings with learned positional embeddings, exactly as BERT does it.

class BERTStyleEmbedding(nn.Module):
    """
    Complete BERT-style embedding layer with:
    - Token embeddings
    - Learned positional embeddings
    - (Optional) Segment/type embeddings
    - Layer normalization
    - Dropout
    """
    
    def __init__(
        self, 
        vocab_size: int, 
        d_model: int, 
        max_len: int = 512,
        dropout: float = 0.1,
        use_segment_embeddings: bool = False,
        n_segments: int = 2
    ):
        super().__init__()
        
        # Token embeddings
        self.token_embeddings = nn.Embedding(vocab_size, d_model)
        
        # Position embeddings
        self.position_embeddings = nn.Embedding(max_len, d_model)
        
        # Segment embeddings (optional, for sentence pair tasks)
        self.use_segment_embeddings = use_segment_embeddings
        if use_segment_embeddings:
            self.segment_embeddings = nn.Embedding(n_segments, d_model)
        
        # Layer norm and dropout
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Position IDs buffer
        self.register_buffer(
            'position_ids',
            torch.arange(max_len).unsqueeze(0)
        )
    
    def forward(
        self, 
        input_ids: torch.Tensor,
        segment_ids: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Args:
            input_ids: Token indices, shape (batch, seq_len)
            segment_ids: Segment indices, shape (batch, seq_len)
        
        Returns:
            Combined embeddings, shape (batch, seq_len, d_model)
        """
        seq_len = input_ids.size(1)
        
        # Get embeddings
        token_emb = self.token_embeddings(input_ids)
        position_emb = self.position_embeddings(self.position_ids[:, :seq_len])
        
        # Combine
        embeddings = token_emb + position_emb
        
        if self.use_segment_embeddings and segment_ids is not None:
            segment_emb = self.segment_embeddings(segment_ids)
            embeddings = embeddings + segment_emb
        
        # Normalize and dropout
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        
        return embeddings


# Test
bert_embedding = BERTStyleEmbedding(
    vocab_size=30000,
    d_model=768,
    max_len=512
)

# Simulate input token IDs
input_ids = torch.randint(0, 30000, (2, 128))
output = bert_embedding(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTotal embedding parameters:")
print(f"  Token embeddings: {30000 * 768:,}")
print(f"  Position embeddings: {512 * 768:,}")
print(f"  Total: {30000 * 768 + 512 * 768:,}")
print("\n✓ BERT-style embedding layer working!")

Input shape: torch.Size([2, 128])
Output shape: torch.Size([2, 128, 768])

Total embedding parameters:
Token embeddings: 23,040,000
Position embeddings: 393,216
Total: 23,433,216

✓ BERT-style embedding layer working!

8. Summary and Key Takeaways

What We Learned

Learned PE is Simple: Just an nn.Embedding table
- Position $i$ → learnable vector $p_i$
- Added to token embedding
Trade-offs:

Pro Con

Flexible, adapts to task Fixed max length

Simple implementation Cannot extrapolate

Proven in BERT, GPT-2 More parameters
The Extrapolation Problem:
- Cannot handle seq_len > max_len
- This motivated RoPE and ALiBi
Use When:
- Fixed-length tasks (classification, NER)
- Following BERT/GPT-2 architecture
- Have large training data

Pro	Con
Flexible, adapts to task	Fixed max length
Simple implementation	Cannot extrapolate
Proven in BERT, GPT-2	More parameters

Coming Up Next

Part 3: RoPE (Rotary Position Embeddings)

The rotation trick that enables unlimited sequence length
Why LLaMA, Mistral, and GPT-4 use RoPE
Elegant math: complex numbers and rotation matrices

References

Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers."
- arxiv.org/abs/1810.04805
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2)
- Learned positional embeddings with max_len=1024

Last updated: January 2026

Previous: Understanding Positional Encodings in Transformers

Learning Position — Trainable Embeddings from BERT to GPT

Part 2: Learned Positional Embeddings #

Introduction #

1. The Simplest Approach: Just Learn It #

Motivation #

⚠️ Key Clarification: What's Given vs. What's Learned #

How It Works #

Mathematical Formulation #

What This Is NOT #

2. Implementation #

3. What Do Learned Embeddings Look Like After Training? #

4. The Parameter Cost: Pros, Cons, and Trade-offs #

Advantages #

Disadvantages #

The Parameter Overhead #

Parameter Count Comparison #

5. The Extrapolation Problem #

6. Should You Use Learned PE? (Probably Not for New Projects) #

The Honest Assessment #

When Learned PE Still Makes Sense #

The Bottom Line #

7. Complete BERT-Style Embedding Layer #

8. Summary and Key Takeaways #

What We Learned #

Coming Up Next #

References #