GEOBPE -- Teaching AI to Read Protein Shapes Like Words

By James Aspinwall, co-written by Alfred (your trusted AI agent) – February 26, 2026, 11:15

A new method called GEOBPE (also written GOBPE) takes the tokenization trick that makes language models work – breaking text into reusable subword pieces – and applies it to three-dimensional protein structures. The result is a compact, interpretable vocabulary of 3D geometric motifs that lets AI systems read, generate, and reason about protein shapes the way GPT reasons about sentences.

This matters because protein function depends on shape, not sequence. And until now, AI has been mostly illiterate when it comes to 3D geometry.

The Problem: Sequence Models Can’t See Shape

Current protein language models treat proteins as strings of amino acids – sequences of letters from a 20-character alphabet. They’re good at predicting which amino acid comes next, but they largely ignore the three-dimensional structure that actually determines what a protein does.

This is like understanding English by memorizing letter frequencies without ever learning what words mean. You can generate plausible-looking sequences, but you have no understanding of the underlying structure.

Existing attempts to tokenize 3D structures – mostly VQ-VAE approaches that map geometry to discrete codebook entries – have serious problems:

Opaque representations: The codebook entries are arbitrary vectors. A human scientist can’t look at “token 847” and understand what shape it represents.
Brittleness: Real proteins are flexible. The same structural motif appears slightly differently in every instance. Rigid codebook entries can’t handle this natural variation.
Geometric drift: When you reconstruct a long protein chain from rigid codewords, small errors at each junction accumulate. By the time you’ve chained 200 residues together, the end of the protein is nowhere near where it should be. The errors compound exponentially with chain length.
Poor generalization: On novel folds not seen during training, VQ-VAE error rates blow up by roughly 6.4x. The model memorized specific shapes instead of learning geometric principles.

The Core Idea: BPE for 3D Geometry

Byte Pair Encoding (BPE) is the tokenization algorithm behind GPT, Claude, and most modern language models. It starts with individual characters, finds the most frequent adjacent pair, merges them into a new token, and repeats. Letters become syllables, syllables become words, words become common phrases. The vocabulary builds from the bottom up based on what actually occurs in the data.

GEOBPE applies this exact logic to protein backbones:

Atomic units: Instead of characters, the base units are individual backbone residues described by internal coordinates – bond lengths, bond angles, and dihedral (torsion) angles. Using internal coordinates rather than Cartesian (x, y, z) positions makes the representation invariant to rotation and translation. A helix is a helix regardless of which direction it’s pointing in space.

Merging: The algorithm scans large protein structure databases (PDB, ESMAtlas) and repeatedly merges the most frequent adjacent residue patterns into larger tokens. A two-residue pattern that appears 50,000 times becomes a single token. Then three-residue patterns. Then four. The vocabulary grows hierarchically, just like text BPE.

What emerges: The learned tokens naturally correspond to biologically meaningful motifs – alpha helices, beta turns, binding pockets, structural domains. Not because anyone told the algorithm about biology, but because these are the recurring patterns in the data. The “words” of protein geometry are real structural features, not arbitrary vectors.

This is the key difference from VQ-VAE tokenizers: GEOBPE tokens are interpretable. You can render “token 48” and see that it’s a specific helical segment. You can render “token 52” and see it’s a particular loop geometry. Scientists can literally read a protein as a sequence of functional geometric motifs.

Solving Geometric Drift: Rigid Shapes, Flexible Joints

Real proteins are wet and wiggly. The same alpha helix appears slightly differently in every protein – small variations in angles, minor thermal fluctuations, different local environments. A tokenizer that demands exact geometric matches will either have an enormous vocabulary (one token per unique instance) or force everything into rigid templates that don’t quite fit.

GEOBPE handles this with a two-part strategy:

Clustering with medoids: Multiple instances of the same motif are clustered, and the representative for each cluster is a real example from the database (a medoid), not a synthetic average. This keeps tokens grounded in physically realistic geometry.

Glue angles with inverse kinematics: Here’s the clever part. When reconstructing a protein from tokens, each token’s internal shape stays rigid – that’s the “what.” But the angles at the junctions between tokens are optimized using differentiable inverse kinematics – that’s the “how they connect.”

Think of it like building with LEGO: each brick has a fixed shape, but you can rotate and angle the connection points to make the overall structure curve, twist, and reach the right endpoint. The optimization adjusts torsion angles at token boundaries so the reconstructed chain stays continuous and the end-to-end 3D position is correct.

This eliminates geometric drift. Instead of errors compounding along the chain, each junction is individually optimized to maintain global geometric accuracy.

The Numbers

Compression: GEOBPE achieves more than 10x reduction in bits per residue compared to leading tokenizers like ProTken, while maintaining similar reconstruction accuracy. This isn’t just a storage optimization – it directly expands the effective context window. A model that previously could only process a single protein domain can now handle entire multi-protein complexes in one pass.

Data efficiency: The model was trained on approximately 34,000 structures. Remarkably, training on just 1% of that set (roughly 340 structures) still matches the downstream accuracy of much larger traditional models. This implies the algorithm is learning geometric and physical rules, not memorizing a statistics table. It understands the grammar of protein geometry, not just the vocabulary.

Generalization: On out-of-distribution benchmarks (CAMEO, CASP14 – standard tests for novel protein folds), VQ-VAE models see error increase by approximately 6.4x. GEOBPE error stays near 1.0-1.1x. It handles folds it has never seen almost as well as folds it trained on. This is the difference between memorization and understanding.

Interpretability: Scientists Can Read the Tokens

Because each token is an explicit 3D shape, researchers can inspect what the model is doing in biological terms:

Token 48-52 in a mitochondrial carrier protein (SLC25A20) maps to the transmembrane binding cavity – the functional core of the protein.
Specific token sequences in the 14-3-3/tau complex correspond to the phospho-binding groove implicated in neurodegeneration.
Helical tokens line up with actual alpha helices. Loop tokens line up with actual loops. The vocabulary is biologically grounded.

This is a fundamental shift from “the model predicted this structure but we don’t know why” to “the model built this structure from these specific geometric motifs, and we can see that motif 23 corresponds to the active site.” Model decisions become auditable in biological terms.

Generative Design: 99% Designable Proteins

The team trained a small transformer (called SSLM) over GEOBPE tokens to generate new protein backbones from scratch. They then tested whether inverse folding algorithms could find amino acid sequences that would actually fold into those generated shapes – a metric called “designability.”

The results: 99% of GEOBPE-generated backbones are designable. Almost every proposed structure corresponds to a thermodynamically plausible protein that could exist in nature.

Compare this to older discrete generative models that produce more structurally diverse outputs but include many physically impossible structures – proteins that look interesting on screen but could never fold in a test tube.

This is the difference between creative and reliable. For drug design and synthetic biology, reliability matters more than creativity. You need proteins that will actually fold as intended. A 99% designability rate means you can generate candidates and be confident that nearly all of them are physically realizable.

Why This Matters Beyond Proteins

The paper presents GEOBPE as a bridge between discrete token-based computation and continuous physical systems. Language models are extraordinarily powerful, but they operate on discrete tokens. Physics is continuous. Finding the right tokenization – the right way to chop a continuous system into reusable discrete pieces – is the key to applying language model architectures to physical domains.

Proteins were the proving ground, but the principle extends further. The authors speculate that similar geometric tokenization could uncover hidden “languages” in other complex continuous systems:

Weather patterns: Are there recurring atmospheric motifs that compose into weather systems?
Fluid dynamics: Can turbulence be decomposed into a vocabulary of reusable flow structures?
Financial markets: Are there geometric patterns in price dynamics that recur and compose?

The answer in each case depends on whether the system has reusable structure at intermediate scales – not just atomic units and not just global behavior, but recurring motifs that combine in meaningful ways. Proteins do. Whether other systems do is an open question, but now there’s a method to find out.

The Practical Takeaway

For anyone building AI systems that interact with physical or biological data, GEOBPE demonstrates a principle worth internalizing: the tokenization matters as much as the model.

GPT didn’t succeed just because of the transformer architecture. It succeeded because BPE tokenization gave the transformer the right units to work with – not individual characters (too granular) and not whole words (too rigid), but subword pieces that capture the reusable structure of language.

GEOBPE does the same thing for 3D geometry. The right tokenization turns an intractable continuous problem into a tractable discrete one, while preserving the structure that makes the domain meaningful.

The model is small. The training data is small. The results are state-of-the-art. That’s what happens when you get the representation right.