Skip to main content

LLM outputs

LLMs produce outputs through a sampling process.

The internals

Before LLMs came into existence, models read text piece by piece, with a small memory window. They lost context in long texts, and failed to link pieces of text that were far apart.

Older models
By the time they reached the end of a sentence, they would already "forget" the start.

Then, transformers came along. Transformers are neural networks based on a mechanism called "attention", which allows them to process entire text sequences at once. They can see how every piece of the text sequence relates to every other piece, regardless of distance.

An LLM is made of many layers of transformers.

Older models
A detailed discussion on the attention mechanism is out of scope here, and we refer you to the original paper: "Attention is all you need"

Tokens

Given a text sequence, a transformer needs to split it into pieces. But what exactly is a "piece"?

In theory, these pieces could be words → The, ultimate, number, is, 16, +, 26, =, 42, .. They could also be characters →T, h, e, u, l, t, i, m, a, t, e, 1, 6, = , ..

While words and characters make sense for humans, they are not practical for a transformer.

Characters are too small. Even simple sentences will split into an unmanageably large number of pieces. This will also dilute meaning, as a single letter tells the transformer almost nothing. It also wastes memory, as the transformer has to look at many pieces to understand context.

Words are too many. Transformers memorize pieces, how they relate to each other, and how they are used with each other. Memorizing the vocabulary of all words will make the transformer slow and heavy. Also, it would have to separately learn walk, walks, walking, walked.

So we find a middle ground. We break the text into subword pieces that are neither characters nor words, but something in between. These pieces are called tokens. We use algorithms to figure out the best subword pieces, that capture and represent the recurring patterns in natural language. While choosing tokens, these algorithms balance token length and vocabulary size.

  • understandingunder, stand, ing
  • walkingwalk, ing
  • "20251201"2025, 12, 01
  • status=activestatus=, act, ive
  • {"id": 12, "ok": true}{, "id, ": , 12, , , "ok": , true}
Older models
You can observe GPT-4's tokenization algorithm at work here.

Sampling tokens

An LLM is trained on large volumes of text. Given a random text sequence, it is trained to predict the next token that is most likely to follow the text sequence. The LLM splits text sequences into tokens, identifies statistical relationships between these tokens, and learns natural language patterns.

When we give a prompt, the LLM:

  1. splits the prompt into a token set.
  2. assigns a probability to every token in its vocabulary (>32k tokens) based on the input token set, which is the likelihood of a vocabulary token being the next token.
  3. picks a token from the probability distribution to be the next token, using a sampling algorithm.
  4. appends the token to the token set.
  5. uses the updated token set to predict the next token.

This runs in a loop, until the model picks the "end-of-sequence" token in a token generation step. All the new tokens it added along the way form the output.

Older models
LLM's don't really understand natural language. They just "continue" text sequences with text likely to follow them, based on statistical patterns of natural language learnt in training.

Subscribe to our newsletter

Updates from the LLM developer community in your inbox. Twice a month.

  • Developer insights
  • Latest breakthroughs
  • Useful tools & techniques