Computers only understand numbers
Before tokens make sense, one thing has to click: at the hardware level, computers only process numbers. Text, images, audio—everything is ultimately ones and zeros.
AI is no different. At its core, an AI model is an enormous mathematical function. Feed it numbers, get numbers back. It might have tens of billions of parameters and process thousands of numbers in seconds, but the fundamental contract never changes: numbers in, numbers out.
So when you type "please review this code" into Claude, how does that string of text become numbers? What about images, audio, or video? That is the problem tokens solve.
What is a token?
The naive approach is one number per letter. A=1, B=2, C=3… That works, but letters carry almost no meaning on their own. h, e, l, p are meaningless in isolation. And cutting things that fine means "helping" takes 7 slots—your context window fills up fast.
The opposite extreme is one number per word. But English alone has dozens of forms of a single word—help, helps, helped, helping, helper—and once you add every language, abbreviation, neologism, and typo, the vocabulary explodes with no upper bound.
Tokens land between those two extremes: meaningful chunks of text, each assigned an integer ID.
"helping" is not 7 letters and not one whole word—it's help + ing, two tokens. "tokenization" becomes token + ization.
Every AI model ships with a vocabulary—a lookup table mapping known tokens to integer IDs:
"hello" → 15496
"help" → 1037
"ing" → 278
"你" → 7979
"好" → 1131GPT-4's vocabulary contains roughly 100,277 tokens. Every sentence you write gets looked up in this table, converted to a sequence of integer IDs, and only then does the model start "thinking." That conversion is called tokenization.
Where token boundaries come from: BPE
Vocabularies are not designed by hand. They are learned from data using an algorithm called BPE (Byte Pair Encoding).
How BPE works
Step one: start from the smallest units. Split all text into individual letters or bytes.
help → h e l p
helped → h e l p e d
helping → h e l p i n g
helper → h e l p e rStep two: find the most frequent adjacent pair, merge it. Count all neighboring character pairs, take the top one, and merge. If h e is most common, it becomes he:
he l p
he l p e d
he l p i n g
he l p e rStep three: repeat. Keep finding the most frequent adjacent pair and merging it—he l becomes hel, then hel p becomes help. Repeat tens of thousands of times and you have a vocabulary of letters, common roots, and common words, typically spanning from tens of thousands to hundreds of thousands of tokens.
What BPE gives you
Common sequences become single tokens; rare ones get split:
"the" → 1 token (very high frequency)
"helping" → 2 tokens (help + ing)
"tokenization" → 3 tokens (varies by model)
"Anthropic" → 2-3 tokens (proper noun, frequency varies)
"Zyxqwvutsrp" → 8+ tokens (rare, falls back to individual letters)The result: a bounded vocabulary that can still express any text. Unknown words get broken into smaller known pieces.
Why token efficiency varies by language and code
BPE learns from training data. The more data a language has, the more efficient its tokenization tends to be—and that difference has real consequences.
English
English is where BPE feels most at home. Common roots (help, tion, ing, un-) show up frequently and get learned as efficient tokens. One English token averages around 4 characters.
Chinese
Chinese characters carry substantial semantic weight individually. Earlier tokenizers often looked close to a one-character-per-token pattern, but newer large-vocabulary tokenizers (for example, o200k_base-class encodings) more often merge common multi-character words into fewer tokens. In practice, Chinese token efficiency has improved versus earlier generations, and for many everyday sentences Chinese and English still land in a similar token-count range, depending on the model tokenizer and wording.
Low-resource languages
Languages like Thai or Arabic that appear less in training data end up with weaker BPE coverage. Many words fall back to byte-level splits—a single Thai word might cost 5–8 tokens to express what one English word says. The same passage can cost 3–4× more tokens in Thai, which means faster context window exhaustion and higher API costs. This is one reason AI models perform worse on low-resource languages: it is not just a data problem, the tokenization itself puts them at a disadvantage from the start.
Code
The most wasteful parts of code are not the logic—it is the whitespace, indentation, and punctuation.
def calculate_average(numbers):
return sum(numbers) / len(numbers)Keywords like def and return are high-frequency and token-efficient. Older tokenizers often fragmented indentation heavily, while modern ones frequently merge common 4- or 8-space patterns into fewer tokens. Even then, identifiers like calculate_average may still split into calculate + _ + average, and punctuation remains a steady source of token cost. For developers: code-heavy prompts still need budget awareness.
Embeddings: how numbers carry meaning
We have text, we have token IDs. But the model does not actually compute with those integer IDs.
Integer IDs carry no semantic information. help = 1037, dog = 2891—the numerical distance between those IDs says nothing about whether the words are semantically related.
What the model actually uses are embedding vectors: each token ID maps to a high-dimensional vector of floating-point numbers.
"help" → ID 1037 → [0.23, -0.81, 0.45, 0.12, -0.33, ...] (768 numbers)
"assist" → ID 4567 → [0.21, -0.79, 0.47, 0.11, -0.31, ...] (768 numbers)
"dog" → ID 2891 → [-0.54, 0.33, -0.12, 0.88, 0.21, ...] (768 numbers)help and assist have nearly identical vectors. dog is far away. This lookup table—token ID to vector—is called the Embedding Table, and it is one of the largest matrices in the model.
Geometry in vector space
The most surprising property of embedding vectors is that geometric relationships map to semantic ones:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")That is not a coincidence. "Gender" forms a stable direction in vector space—the displacement from king to queen is geometrically parallel to the displacement from man to woman. The same structure appears elsewhere:
"Paris" - "France" + "Japan" ≈ "Tokyo"
"walk" - "present" + "past" ≈ "walked"The vector space contains real semantic geometry. The model has not just learned where words appear—it has learned the relational network between them.
Why vectors learn to carry meaning
Here is the natural follow-up question: how do 768 numbers end up encoding semantics? Did someone hand-design each dimension? Not at all. It emerges from training.
A simple analogy
Imagine describing fruit with two numbers: sweetness and sourness.
lemon → [1, 9]
apple → [7, 4]
watermelon → [8, 1]
grapefruit → [4, 8]Lemon and grapefruit cluster together—both sour. Watermelon and apple cluster together—both sweet. Things that mean similar things end up close in vector space. Two dimensions can only capture sweet/sour. Language has vastly more semantic dimensions: part of speech, emotional tone, abstraction level, domain, relational structure. That is why you need 768 dimensions instead of 2.
How semantics emerges from training
At the start of training, the Embedding Table is random noise—help and dog have meaningless vectors. The training objective is simple: given the tokens so far, predict the next token.
The model sees "Please _____ me with this task" and needs to guess the blank. The correct answer is help, but it guesses dog. That is wrong. Backpropagation traces which numbers caused the error and nudges them.
What does that nudging do? The model notices that in the context "Please ___ me," both help, assist, and support are reasonable predictions, and dog is not. So it slowly pushes the vectors for help, assist, and support closer together, and pushes dog away.
Do that hundreds of billions of times on trillions of words, and vector space organizes itself into semantic geometry—not because anyone designed it, but because words with similar meanings naturally appear in similar contexts. The training process encodes that statistical regularity into the geometry of vectors.
Nobody told the model that "help" and "assist" are synonyms. It inferred that from trillions of words.
What is an Embedding Model?
Once you understand the Embedding Table, you might wonder what text-embedding-3-small is in the OpenAI API—or why Dify has a separate "Embedding Model" setting. Is that the same thing? Not quite.
The problem with Embedding Tables: they are static
An Embedding Table is a static lookup. bank always maps to the same vector, whether you mean a financial institution or a riverbank:
"I went to the bank to deposit money" → bank = financial institution
"The river bank was covered in mud" → bank = riverbankA static table cannot distinguish these. Same word, same vector.
Embedding Models: context-aware vectors
An Embedding Model does one more thing after the table lookup: it runs an Attention mechanism that lets each token's vector shift based on the surrounding tokens.
In the first sentence, bank's vector gets pulled toward "financial institution" by deposit and money. In the second, it gets pulled toward "riverbank" by river and mud. The result: a static token vector transformed into a context-aware semantic vector.
Embedding Models as standalone products
The Embedding Model you configure in Dify is a specialized version of this: it outputs vectors only—no generated text. Its main use case is RAG (Retrieval-Augmented Generation):
- Upload a document; it gets split into passages
- Each passage runs through the Embedding Model to produce a vector representing its meaning
- Those vectors get stored in a vector database
- Your question also runs through the Embedding Model
- The database finds the passages whose vectors are closest to your question's vector (semantically similar content)
- Those passages get passed to the LLM to generate an answer
That is why RAG understands the relationship between your question and the document—it is not keyword matching, it is vector proximity in semantic space. For a full walkthrough of RAG in Python, see Retrieval-Augmented Generation: Concepts and a Python Walkthrough.
Embedding Table
→ Static lookup. Token ID → base vector. Context-unaware.
Embedding Model (inside a large language model)
→ Runs Attention after the table lookup. Vectors shift with context.
Embedding Model (standalone product)
→ Compresses a full passage into one semantic vector. Used in RAG and similarity search.Tokenizing code
Code differs from natural language in a fundamental way: its structure is rigid, and a single character difference can completely change what a program does.
Indentation is semantic
In natural language, whitespace is visual. In Python, indentation controls execution flow:
if x > 0:
return x # inside the if
return -x # outside the ifTwo return statements, different indentation, completely different meaning. From BPE's perspective, whitespace is just another character with no structural significance. Early code models frequently got indentation wrong; modern coding tokenizers learn common indentation patterns (four spaces, tabs) as single tokens.
Token boundaries and syntax boundaries do not align
camelCase and snake_case identifiers produce highly unpredictable BPE splits. Common API names (getUserById) might have been seen enough to tokenize efficiently, but your custom long names (calculateMonthlyRevenueByRegion) get shredded:
calculateAverageScore
→ calculate / Average / Score
or
→ calc / ulate / Average / ScoreSame symbol, different meaning across languages
{} TypeScript → object literal, code block
{} Python → dict literal, set literal
{} CSS → style blockA general-purpose tokenizer sees { as a high-frequency character and nothing more. The semantic distinction has to be learned by Attention from context.
Syntax correctness is not the tokenizer's job
Tokenization just segments text and assigns IDs. Syntactic validity is a statistical pattern the model learns from large amounts of code. That is why AI sometimes generates code that is almost-correct-but-missing-a-bracket: the tokenizer has no syntax checker, and the model is doing statistical inference.
Tokenizing gene sequences
Here is something that surprises most people: gene sequences and natural language are remarkably similar at the tokenization level.
The basics
DNA is built from four bases: A (adenine), T (thymine), G (guanine), C (cytosine). A stretch of human DNA looks like this:
ATCGGCTATGCAATCGGCTATGCA...Just a very long string of ATCG. The full human genome is around 3.2 billion bases.
Why text and genes are structurally similar
Text is a sequence of letters where the meaningful unit is a word. DNA is a sequence of bases where the meaningful unit is a functional fragment. Certain letter combinations are highly frequent in text (the, ing, tion); certain base combinations are highly frequent in genomes—and those combinations typically have biological function. Both are sequences over finite symbol sets, encoding information through combinations and arrangement.
How gene sequences get tokenized
Option 1: fixed-length k-mers. Split into fixed-length fragments, e.g., 6 bases at a time (6-mers):
ATCGGCTATGCA
→ ATCGGC / TATGCA4 bases × 6 positions = 4⁶ = 4096 possible 6-mers—that is the vocabulary size.
Option 2: run BPE directly. Same algorithm as text. Interestingly, the tokens BPE learns from genomic data often correspond to real biological functional units—not because anyone labeled them, but because functionally important fragments appear repeatedly and therefore have high frequency.
Embeddings work on genes too
Train a Transformer on gene tokens the same way you train it on text tokens:
- Functionally similar gene fragments cluster together in vector space
- Protein-coding regions and regulatory regions form distinct clusters
- Just as
helpandassistcluster, andking - man + woman ≈ queenholds
Meta's ESM (Evolutionary Scale Modeling) does exactly this for protein sequences (built from 20 amino acids). AlphaFold's ability to predict 3D protein structure is grounded in this same sequence-embedding approach.
The key finding: the geometric relationships learned in vector space by a model trained on gene sequences closely match patterns biologists spent decades discovering through lab experiments—with no access to any biology textbook.
Text: letters → BPE tokens → Embeddings → linguistic meaning
Genes: ATCG → k-mers / BPE → Embeddings → biological meaningTokenizing self-driving sensor data
Self-driving pushes tokenization into a new dimension: the input is no longer symbols but continuous physical-world sensor data.
The sensor complexity
Camera → 30 frames/sec, typically 8-12 angles
LiDAR → dozens of sweeps/sec, 360° laser scan → 3D point cloud
Radar → detects speed and distance through rain and fog
GPS → current position
IMU → acceleration, angular velocityEach modality has a completely different data format, requiring its own tokenization strategy.
The hard problem: 3D point clouds
Camera images can be sliced into patches. But LiDAR outputs a point cloud—tens of thousands of 3D coordinate points, each recording (x, y, z) plus reflectance intensity.
(1.2, 3.4, 0.1, 0.8)
(1.3, 3.5, 0.1, 0.9)
(5.7, 2.1, 1.4, 0.3)
... (100,000 points total)These points have no fixed ordering and no grid structure, so the clean patch-slicing approach does not apply. Two solutions exist:
Voxelization: divide 3D space into a grid of small cubes. Average the points within each occupied cube into a single vector—that vector is the token for that spatial location.
Point tokens: PointNet-style models treat each 3D point directly as a token input, learning a vector for each point and then running Attention across all points so they influence each other.
Multimodal fusion
The core challenge is fusing tokens from all sensors. Camera sees color and texture—it recognizes traffic lights and faces. LiDAR gives precise 3D depth—it knows the object ahead is 8.3 m away and 1.7 m tall. Radar penetrates rain and fog when cameras fail.
The fusion happens by running all sensor tokens through the same Attention layers. A camera "pedestrian patch token" and a LiDAR "1.7 m object token" that correspond to the same 3D location will reinforce each other: "this location has a human-shaped silhouette and a height consistent with a person—confirmed pedestrian."
The time dimension
Understanding motion requires looking across time. The solution: feed multiple past frames of sensor data together, and let Attention operate across time:
t-2 point cloud tokens (200ms ago)
t-1 point cloud tokens (100ms ago)
t point cloud tokens (now)
↓
Cross-time Attention
↓
"pedestrian ahead moving right at 1.2 m/s"Modern self-driving models (Tesla FSD, Waymo's latest architecture) also output tokens—behavior tokens—letting the model explain its decisions in natural language and making it easier for engineers to debug.
Tokenizing 3D models
3D modeling extends the point cloud idea, but because 3D data comes in multiple formats, there are several distinct tokenization strategies.
3D data formats
Point Cloud → a set of 3D coordinate points with no connectivity
Mesh → vertices + triangular faces defining a surface
Voxel → 3D pixels, space divided into small cubes
Implicit (NeRF / SDF) → a math function that describes occupancy at any pointMesh tokens
Mesh is the most common format in tools like Blender and Maya—vertices and triangular faces that define a surface. Serializing a mesh into tokens means flattening the vertex coordinates and face connectivity into a sequence. Google's MeshGPT takes this approach, treating mesh as a "language" and training a Transformer to generate valid vertex and face sequences.
Vertex list:
V1 = (0, 0, 1)
V2 = (0.7, 0, 0.7)
Face list:
F1 = (V1, V2, V3) ← one triangle from three verticesGenerating a 3D model becomes: predict the next vertex coordinate or the next face connection—the exact same mechanism as predicting the next text token.
NeRF and 3D Gaussian implicit tokens
3D Gaussian Splatting represents a scene as hundreds of thousands of 3D Gaussian ellipsoids, each with position, scale, orientation, color, and opacity parameters. That parameter set is itself a vector:
Gaussian 1: (x, y, z, scale_x, scale_y, scale_z, rotation, r, g, b, opacity)
Gaussian 2: (...)
→ each set of parameters = one tokenRecent research uses Diffusion Models to generate and edit 3D scenes directly in "Gaussian token space"—the same denoising-in-latent-space logic as image generation.
Tokenizing weather
In 2023, Google DeepMind's GraphCast and Huawei's Pangu-Weather reported results that were competitive with ECMWF—the European Centre for Medium-Range Weather Forecasts, long considered a gold standard in weather prediction—with some metrics showing advantages under specific evaluation setups. ECMWF runs thousands of physical equations on supercomputers for hours, while these AI systems can produce global 10-day forecast inference much faster.
What weather data looks like
Weather forecast inputs are called reanalysis data. Think of Earth's surface as a grid. ERA5, the most widely used dataset, has 0.25° × 0.25° resolution—1,440 × 721 grid points globally, 37 atmospheric layers per point, one snapshot every 6 hours, going back to 1940. This is a 4D dataset: latitude × longitude × altitude × time. Each grid cell records temperature, pressure, wind speed (east-west and north-south), humidity, and geopotential height.
How weather data gets tokenized
Step one: spatial patches. Slice Earth's grid into regional blocks, just like cutting an image into patches. Flatten all the variable values in each block into a single vector—that vector is one spatial token.
Step two: add altitude. Each location has 37 atmospheric layers, each either its own token or bundled into the spatial vector.
Step three: add time. A single snapshot is not enough—weather is a dynamic system:
t-2 global grid tokens (12 hours ago)
t-1 global grid tokens (6 hours ago)
t global grid tokens (now)
↓
predict t+1 global grid state (6 hours from now)GraphCast's twist: graph neural networks
Earth is a sphere, and flat grids distort it—equatorial cells and polar cells cover very different actual areas. GraphCast builds Earth's surface as a graph: nodes are geographic locations, edges are connections between neighbors. Information flows along true spherical adjacency, not across all tokens at once, correctly handling spherical geometry.
Why AI beats physics models
Physics-based forecasting works by solving simplified equations—but some phenomena (turbulence, cloud microphysics) are too complex to model exactly, and the approximation errors compound over multi-day forecasts. AI does not write any equations; it learns the mapping "current state → future state" directly from 80 years of historical data, capturing patterns that the equations miss.
What emerged in the trained vector space is striking: days before a typhoon forms, the token vectors over the relevant ocean region start shifting toward a "pre-typhoon" direction. Pacific sea surface temperature anomaly tokens (El Niño) and European rainfall tokens thousands of kilometers away have high Attention scores—a long-range atmospheric teleconnection that took meteorologists decades of research to establish.
A physicist's equations are humanity's understanding of the atmosphere. An AI's embeddings are the data's understanding of the atmosphere.
The shared nature of tokens
Having walked through text, code, genes, self-driving, 3D, and weather, let us put them side by side:
Text → letter sequence → BPE segments → linguistic meaning
Code → char sequence → BPE + syntax units → program structure
Images → pixel grid → patch slices → visual meaning
Audio → pressure wave → spectrogram patch → acoustic meaning
Video → image sequence → spatiotemporal patch → motion meaning
Genes → ATCG sequence → k-mers / BPE → biological meaning
Self-driving → sensor data → voxels / point tokens → spatial meaning
3D models → geometric data → vertex / face tokens → geometric meaning
Weather → Earth grid → spatiotemporal patch → atmospheric meaningEvery domain has completely different input data—but strip away the specifics and every row does the same three things.
The three things all tokenization does
One: find the smallest unit that carries independent meaning. Not the smallest physical unit (letter, pixel, single base), but the smallest cut in this domain that has its own semantic identity. Roots in text, patches in images, functional fragments in genomes. Cut too fine and meaning vanishes; cut too coarse and combinations explode. Tokenization is finding the sweet spot.
Two: convert that unit to a fixed-format vector. Whatever the raw material—symbols, pixels, coordinates, pressure values—the output must be the same format: a fixed-dimension vector of floats. Because Transformers only accept that one format. Vectors are the universal interface through which all modalities enter AI.
Three: let meaning emerge from statistical patterns. Nobody defines what any vector "means." Meaning is learned automatically from data. Units with similar function naturally appear in similar contexts; after training, their vectors end up close in space. This holds across every domain.
The deeper point
Those three things describe a single underlying concept:
Discretizing a continuous, complex world into symbolic units that AI can compute with.
Humans do this too. We discretize language into words, music into notes, pictures into pixels, genetic material into ATCG. These are human-invented tokenization systems—we just never called them that. AI tokenization is, at its core, an imitation of how humans make sense of the world: carve continuous reality into discrete symbols, then find meaning in how the symbols combine. The difference is that human symbol systems evolved culturally, while AI token systems are learned statistically from data.
The token boundary is the boundary of what AI can understand
People tend to think of tokenization as a technical implementation detail—an engineering concern that has nothing to do with whether an AI can actually understand the world. But the picture is different: how you tokenize determines what AI can and cannot understand.
Wrong tokenization, meaning disappears—one base per token, and the model cannot learn gene function. Insufficient training data, vectors fail to organize—poor Thai tokenization, weaker model. No tokenization scheme for a modality, AI cannot process it—until someone designed point cloud tokenization, AI could not "see" the 3D world.
To a large extent, what AI can understand is bounded by whether humans can design an effective tokenization and representation for that modality.
The Cost Of Tokens: Blind Spots They Introduce
We have focused on what tokenization enables, but it also introduces predictable limits. A common example is the "strawberry problem": ask for how many r's are in "strawberry", and models sometimes miss. The issue is not that they literally "cannot read letters," but that they first process token chunks (for example straw + berry or other splits) and reason in vector space. That representation is excellent for semantics, but less natural for exact character-level counting.
Another example is arithmetic. Strings like 12345 are often segmented into variable-length token pieces chosen for language compression, not for place-value arithmetic. Recent research shows numerical tokenization design can materially affect addition and multiplication accuracy. That is why production systems often pair LLMs with external calculators or enforce intermediate step expansion in prompts.
Closing
From the opening question—"what is a token?"—a clear thread has run through everything: tokens are not a text-specific concept. They are a universal one: take any form of information, find the smallest meaningful unit, convert it to a vector, and AI can work with it in a unified mathematical space.
Text, images, genes, weather, 3D models, self-driving sensors—from AI's perspective, all of it is the same thing: a sequence of tokens.
And meaning? It emerges, naturally, from hundreds of billions of predictions and corrections, written into the geometry of vectors.
This is not just a technical detail. It is the way AI understands the world.
