Skip to main content

Sampling methods

Constrained decoding is a filter to mask illegal tokens, but you can also decide how the LLM selects from the remaining valid tokens.

Greedy Decoding (temperature=0)

For most tasks (data extraction, classification, function calling), it is best to use greedy decoding with temperature set to 0.

It is deterministic and selects the token with the highest logit. In a sense, you get the most "logical" path through the schema. This minimizes hallucinations where the model might otherwise pick a valid but statistically unlikely character.

Greedy Decoding (temperature > 0)

Use sampling (typically temperature between 0.2 and 0.7) only when the schema contains prose fields. For example, if you are generating a JSON object for a character description, you want variety in the description string.

It can lead to "Grammar Traps." The model might sample a token that is syntactically valid but forces the model into a logical dead end later in the sequence, resulting in repetitive or nonsensical output.

Disable Repetition Penalty

Standard penalties often break JSON generation because valid schemas require repeating keys (e.g., "id", "name").

Benchmark

Run your pipeline at temp=0 first to establish a baseline for accuracy. Only introduce sampling if the outputs feel too robotic for your specific use case.

Avoid Top-K/Top-P with strict schemas

These can sometimes prune the only valid token allowed by the grammar, leading to runtime errors or generation stalls in certain older backends.

Beam search explores multiple "paths" of generation simultaneously and keeps the top N most likely sequences (the "beam width").

In structured output, the "best" token right now might lead to a low-probability sequence later. Beam search looks ahead to find the globally most likely valid JSON or code block.

It is best for outputs like SQL queries or code snippets where structural correctness is non-negotiable but the logic is complex. Most libraries like outlines or vLLM support this, but it increases VRAM usage and latency linearly with the beam width.

Min-P Sampling

Min-P is a modern alternative to Top-P (Nucleus) sampling. It filters out tokens that have a probability lower than a certain percentage of the top token's probability.

Top-P can be too aggressive in structured tasks, sometimes cutting off the only valid character (like a closing brace }) if its probability is low. Min-P scales with the model's confidence. If the model is 99% sure, Min-P clears the "noise" more effectively than Top-P.

It is best for maintaining creativity in "prose" fields within a JSON schema without the instability of standard temperature sampling.

Best-of-N (Rejection Sampling)

This is a "meta-method" where you generate N independent completions (e.g., using a temperature of 0.7) and use a separate reward model or a validator to pick the best one.

Instead of trying to get the sampling right in one go, you generate 5 valid JSON objects and pick the one that passes a secondary check (like a linter or a unit test).

It is best for synthetic data and complex code generation.

Typical Sampling

Typical sampling seeks to pick tokens that are "typical" for the model's current state, rather than just the most likely ones.

It focuses on the "information content" (entropy). In structured outputs, this helps avoid the repetitive, looping behavior often seen in greedy decoding when the model gets stuck in a pattern of whitespace or redundant keys.

It is best for long-form structured generation (like a full API documentation).