Stormin' The Castle

positional encodinglearned embeddingstransformers

Learning Position โ€” Trainable Embeddings from BERT to GPT

by John Robinson @johnrobinsn

Part 2: Learned Positional Embeddings

Why Transformers Need Help Knowing Where | A Six-Part Series


๐Ÿ“š Series Navigation:


Introduction

In Part 1, we explored sinusoidal positional encodings โ€” a fixed, mathematical approach to encoding position. But what if we just... let the model figure it out?

Learned positional embeddings take a radically simple approach: create a trainable lookup table where each position gets its own learnable vector. This is the approach used by BERT, GPT-2, and many other influential models.

This notebook covers:

  1. How learned PE works (spoiler: it's just nn.Embedding)
  2. Implementation and visualization
  3. The critical extrapolation limitation
  4. When to choose learned vs. fixed encodings

๐Ÿ“„ Paper: BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)

# Install dependencies (run once)
!pip install -q numpy matplotlib seaborn torch
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple

# Set visualization styles
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = [10, 6]

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("โœ“ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

โœ“ All libraries imported successfully!
PyTorch version: 2.9.1+cu128

1. The Simplest Approach: Just Learn It

Motivation

Instead of designing a clever mathematical function to encode position, learned positional embeddings ask: Why not let the model learn the best way to represent each position?

โš ๏ธ Key Clarification: What's Given vs. What's Learned

This is a common point of confusion, so let's be precise:

Component Status Explanation
Position index (0, 1, 2, ...) Explicitly provided The model is told "this token is at position 5" โ€” it doesn't discover this from context
Embedding vector for each position Learned via backpropagation The model learns what values best represent "position 5"

The position index IS the ground truth signal. When we process a sequence, we don't ask the model to figure out which position each token occupies โ€” we explicitly look up row 0 for position 0, row 5 for position 5, etc. This mapping is hardwired.

What's learned is HOW to represent each position โ€” the actual d-dimensional vector of floats that gets added to the token embedding.

๐Ÿ”‘ Analogy: Think of a classroom seating chart. The seat numbers (positions 0, 1, 2...) are fixed and labeled on each desk. What's "learned" over time is the personality/reputation associated with each seat ("the front row is for eager students"). The model learns useful representations, but it's never confused about which seat is which.

How It Works

  1. Create a lookup table (matrix) with one row per position, randomly initialized
  2. During training, use the known position index to look up the corresponding row
  3. Backpropagation updates the embedding values to minimize loss
outputi=token_embeddingi+position_embedding[i]\text{output}_i = \text{token\_embedding}_i + \text{position\_embedding}[i]

Where position_embedding[i] means: "look up row ii in the embedding table" โ€” the index ii is known, the values in that row are learned.

Mathematical Formulation

Given:

The position embedding matrix PโˆˆRLร—dP \in \mathbb{R}^{L \times d} is initialized randomly and optimized during training:

P=[p0p1โ‹ฎpLโˆ’1]P = \begin{bmatrix} p_0 \\ p_1 \\ \vdots \\ p_{L-1} \end{bmatrix}

For a token at position ii:

eifinal=eitoken+pie_i^{\text{final}} = e_i^{\text{token}} + p_i

What This Is NOT

To further clarify, learned positional embeddings are not:

Those would be fascinating (and much harder!) problems. Instead, this is simply: "Given that I know this token is at position ii, what's the best vector to represent that position?"

2. Implementation

Let's implement learned positional embeddings as a PyTorch module.

class LearnedPositionalEmbedding(nn.Module):
"""
Learned positional embeddings as used in BERT and GPT-2.

Each position (0, 1, 2, ..., max_len-1) has its own trainable
embedding vector of dimension d_model.

Args:
d_model: Dimension of the embeddings
max_len: Maximum sequence length
dropout: Dropout probability
"""


def __init__(self, d_model: int, max_len: int = 512, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)

# The key component: a learnable embedding table
# Shape: (max_len, d_model)
self.position_embeddings = nn.Embedding(max_len, d_model)

# Register position indices as a buffer (not a parameter)
# This avoids creating new tensors on every forward pass
self.register_buffer(
'position_ids',
torch.arange(max_len).unsqueeze(0) # Shape: (1, max_len)
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Add learned positional embeddings to input.

Args:
x: Input tensor of shape (batch_size, seq_len, d_model)

Returns:
Tensor with positional embeddings added
"""

seq_len = x.size(1)

# Get position IDs for current sequence length
position_ids = self.position_ids[:, :seq_len] # Shape: (1, seq_len)

# Look up position embeddings
position_embeds = self.position_embeddings(position_ids) # (1, seq_len, d_model)

# Add to input (broadcasts across batch dimension)
x = x + position_embeds

return self.dropout(x)


# Test the module
batch_size = 2
seq_len = 10
d_model = 64

learned_pe = LearnedPositionalEmbedding(d_model, max_len=512)
x = torch.randn(batch_size, seq_len, d_model)
output = learned_pe(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nPosition embedding table shape: {learned_pe.position_embeddings.weight.shape}")
print(f" โ†’ {512} positions ร— {d_model} dimensions")
print(f"\nTrainable parameters: {learned_pe.position_embeddings.weight.numel():,}")
print("\nโœ“ LearnedPositionalEmbedding module working!")

Input shape: torch.Size([2, 10, 64])
Output shape: torch.Size([2, 10, 64])

Position embedding table shape: torch.Size([512, 64])
โ†’ 512 positions ร— 64 dimensions

Trainable parameters: 32,768

โœ“ LearnedPositionalEmbedding module working!

3. What Do Learned Embeddings Look Like After Training?

At initialization, learned positional embeddings are just random noise โ€” there's no meaningful structure. However, after training on large corpora, learned embeddings typically develop smooth, structured patterns that often resemble sinusoidal encodings.

Research has shown that:

๏ฟฝ๏ฟฝ Empirical finding: When visualized as heatmaps, trained learned embeddings often show wave-like patterns similar to sinusoidal PE, suggesting the model "rediscovers" useful mathematical structure through gradient descent.

This convergence to structured patterns provides some validation that sinusoidal encodings were a reasonable design choice โ€” but learned embeddings have the flexibility to deviate when beneficial for the task.

4. The Parameter Cost: Pros, Cons, and Trade-offs

Advantages

Aspect Benefit
โœ… Flexibility Can adapt to task-specific positional patterns
โœ… Simplicity Very easy to implement (just nn.Embedding)
โœ… Proven Used successfully in BERT, GPT-2, GPT-3

Disadvantages

Aspect Drawback
โŒ Fixed Length Cannot handle sequences longer than max_len
โŒ Extrapolation Poor generalization to unseen positions
โŒ Parameters Adds Lร—dL \times d trainable parameters
โŒ Data Hungry Needs enough data to learn good representations

The Parameter Overhead

Unlike sinusoidal PE (which has zero trainable parameters), learned PE adds significant overhead:

PEย Parameters=max_lenร—dmodel\text{PE Parameters} = \text{max\_len} \times d_{\text{model}}

These parameters consume GPU memory and must be trained โ€” positions that appear rarely in training data may have poorly-learned representations. For long-context models (32k, 100k+ tokens), this becomes impractical: a 100k ร— 4096 table would add 400M parameters just for positions. This is why modern long-context models use RoPE or ALiBi instead.

Parameter Count Comparison

# Parameter count for different configurations
configs = [
("BERT-base", 512, 768),
("GPT-2 small", 1024, 768),
("GPT-2 medium", 1024, 1024),
("GPT-2 large", 1024, 1280),
]

print("=" * 60)
print("LEARNED PE PARAMETER COUNTS")
print("=" * 60)
print(f"\n{'Model':<15} {'Max Len':<10} {'d_model':<10} {'PE Params':<15}")
print("-" * 60)

for name, max_len, d_model in configs:
params = max_len * d_model
print(f"{name:<15} {max_len:<10} {d_model:<10} {params:,}")

print("-" * 60)
print("\nNote: Sinusoidal PE has 0 trainable parameters!")
print(" These PE parameters are ~0.4-1.3M per model.")

============================================================
LEARNED PE PARAMETER COUNTS
============================================================

Model Max Len d_model PE Params
------------------------------------------------------------
BERT-base 512 768 393,216
GPT-2 small 1024 768 786,432
GPT-2 medium 1024 1024 1,048,576
GPT-2 large 1024 1280 1,310,720
------------------------------------------------------------

Note: Sinusoidal PE has 0 trainable parameters!
These PE parameters are ~0.4-1.3M per model.

5. The Extrapolation Problem

The most critical limitation of learned positional embeddings: they cannot handle sequences longer than max_len.

This is because the embedding table has a fixed size. Position 512 in a model trained with max_len=512 simply doesn't exist โ€” there's no vector to look up!

# Demonstrate the extrapolation limitation
print("=" * 70)
print("LEARNED PE: EXTRAPOLATION LIMITATION")
print("=" * 70)

# Create module with max_len=100
learned_pe_short = LearnedPositionalEmbedding(d_model=64, max_len=100)

test_lengths = [50, 100, 101]

print(f"\nModule created with max_len=100")
print(f"\n{'Sequence Length':<20} {'Status':<30}")
print("-" * 70)

for length in test_lengths:
try:
x_test = torch.randn(1, length, 64)
_ = learned_pe_short(x_test)
print(f"{length:<20} โœ“ Works")
except Exception as e:
error_type = type(e).__name__
print(f"{length:<20} โœ— FAILS ({error_type})")

print("\n" + "=" * 70)
print("CRITICAL: Sequences longer than max_len cause errors!")
print("=" * 70)
print("\nThis is why modern LLMs (2023+) prefer RoPE or ALiBi:")
print("โ€ข RoPE: Computes rotation on-the-fly, no position limit")
print("โ€ข ALiBi: Computes bias on-the-fly, no position limit")

======================================================================
LEARNED PE: EXTRAPOLATION LIMITATION
======================================================================

Module created with max_len=100

Sequence Length Status
----------------------------------------------------------------------
50 โœ“ Works
100 โœ“ Works
101 โœ— FAILS (RuntimeError)

======================================================================
CRITICAL: Sequences longer than max_len cause errors!
======================================================================

This is why modern LLMs (2023+) prefer RoPE or ALiBi:
โ€ข RoPE: Computes rotation on-the-fly, no position limit
โ€ข ALiBi: Computes bias on-the-fly, no position limit

6. Should You Use Learned PE? (Probably Not for New Projects)

The Honest Assessment

For new projects in 2025+, there's little reason to choose learned positional embeddings over alternatives like RoPE:

Approach Parameters Extrapolation Complexity
Learned PE Lร—dL \times d โŒ Hard limit at max_len Simple
Sinusoidal PE 0 โš ๏ธ Degrades beyond training Simple
RoPE 0 โœ… Generalizes well Moderate
ALiBi 0 โœ… Generalizes well Simple

RoPE has become the de facto standard for modern LLMs (LLaMA, Mistral, GPT-4) because it offers:

When Learned PE Still Makes Sense

  1. Fine-tuning existing models: If you're using BERT, GPT-2, or other pre-trained models that already use learned PE, you'd continue with that architecture

  2. Fixed-length classification: Tasks like sentiment analysis where you'll never exceed max_len and don't need extrapolation

  3. Educational purposes: nn.Embedding is trivially simple to understand compared to RoPE's rotation matrices

  4. Reproducing published results: When replicating papers that used learned PE

The Bottom Line

๐ŸŽฏ Recommendation: Unless you have a specific reason to use learned PE (backwards compatibility, reproducing prior work), default to RoPE for new projects. It's what the field has converged on, and the extrapolation problem alone makes learned PE a risky choice for any application where sequence lengths might grow.

We include learned PE in this series because it's historically important (BERT, GPT-2) and conceptually illuminating โ€” but it's largely a legacy approach at this point.

7. Complete BERT-Style Embedding Layer

Let's build a complete embedding layer that combines token embeddings with learned positional embeddings, exactly as BERT does it.

class BERTStyleEmbedding(nn.Module):
"""
Complete BERT-style embedding layer with:
- Token embeddings
- Learned positional embeddings
- (Optional) Segment/type embeddings
- Layer normalization
- Dropout
"""


def __init__(
self,
vocab_size: int,
d_model: int,
max_len: int = 512,
dropout: float = 0.1,
use_segment_embeddings: bool = False,
n_segments: int = 2
):
super().__init__()

# Token embeddings
self.token_embeddings = nn.Embedding(vocab_size, d_model)

# Position embeddings
self.position_embeddings = nn.Embedding(max_len, d_model)

# Segment embeddings (optional, for sentence pair tasks)
self.use_segment_embeddings = use_segment_embeddings
if use_segment_embeddings:
self.segment_embeddings = nn.Embedding(n_segments, d_model)

# Layer norm and dropout
self.layer_norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

# Position IDs buffer
self.register_buffer(
'position_ids',
torch.arange(max_len).unsqueeze(0)
)

def forward(
self,
input_ids: torch.Tensor,
segment_ids: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Args:
input_ids: Token indices, shape (batch, seq_len)
segment_ids: Segment indices, shape (batch, seq_len)

Returns:
Combined embeddings, shape (batch, seq_len, d_model)
"""

seq_len = input_ids.size(1)

# Get embeddings
token_emb = self.token_embeddings(input_ids)
position_emb = self.position_embeddings(self.position_ids[:, :seq_len])

# Combine
embeddings = token_emb + position_emb

if self.use_segment_embeddings and segment_ids is not None:
segment_emb = self.segment_embeddings(segment_ids)
embeddings = embeddings + segment_emb

# Normalize and dropout
embeddings = self.layer_norm(embeddings)
embeddings = self.dropout(embeddings)

return embeddings


# Test
bert_embedding = BERTStyleEmbedding(
vocab_size=30000,
d_model=768,
max_len=512
)

# Simulate input token IDs
input_ids = torch.randint(0, 30000, (2, 128))
output = bert_embedding(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTotal embedding parameters:")
print(f" Token embeddings: {30000 * 768:,}")
print(f" Position embeddings: {512 * 768:,}")
print(f" Total: {30000 * 768 + 512 * 768:,}")
print("\nโœ“ BERT-style embedding layer working!")

Input shape: torch.Size([2, 128])
Output shape: torch.Size([2, 128, 768])

Total embedding parameters:
Token embeddings: 23,040,000
Position embeddings: 393,216
Total: 23,433,216

โœ“ BERT-style embedding layer working!

8. Summary and Key Takeaways

What We Learned

  1. Learned PE is Simple: Just an nn.Embedding table

    • Position ii โ†’ learnable vector pip_i
    • Added to token embedding
  2. Trade-offs:

    Pro Con
    Flexible, adapts to task Fixed max length
    Simple implementation Cannot extrapolate
    Proven in BERT, GPT-2 More parameters
  3. The Extrapolation Problem:

    • Cannot handle seq_len > max_len
    • This motivated RoPE and ALiBi
  4. Use When:

    • Fixed-length tasks (classification, NER)
    • Following BERT/GPT-2 architecture
    • Have large training data

Coming Up Next

Part 3: RoPE (Rotary Position Embeddings)


References

  1. Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers."

  2. Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2)

    • Learned positional embeddings with max_len=1024

Last updated: January 2026


Share on Twitter |  Discuss on Twitter

John Robinson © 2022-2025