Skip to main content

The problem

It is now easy to understand why structured outputs are hard for LLMs. We'll use a quick example to demonstrate this.

The example

We deploy a chatbot in our online store. The chatbot talks to customers and places orders on their behalf. Here’s a customer message:

“Hi, I want to order three bottles of the citrus body wash and one lavender candle.”

To add this order to our database, we need to extract order details in a specific JSON format:

{
"customer": {
"id": "C-10322",
"name": "Ariana Reed"
},
"order": {
"items": [
{"sku": "BW-CITRUS", "qty": 3},
{"sku": "CANDLE-LAVENDER", "qty": 1}
],
"total_usd": 54.00,
"discount_usd": 10.00
}
}

The solution

We decide to use an LLM, and write this prompt:

Here are my store listings:

SKU ID Product Name Unit Price (USD)
BW-CITRUS Citrus Refresh Body Wash $12.00
BW-EUCALYPT Eucalyptus & Mint Body Wash $12.00
CANDLE-LAVENDER Relaxing Lavender Soy Candle $18.00
CANDLE-SANDAL Sandalwood & Amber Candle $20.00
LOTION-SHEA Raw Shea Butter Hand Cream $15.00
SOAP-OATMEAL Exfoliating Oatmeal Bar Soap $8.00
DIFF-BERGAMOT Bergamot Essential Oil Diffuser $45.00
SCRUB-SUGAR Brown Sugar Body Scrub $22.00
BALM-PEPPER Peppermint Lip Balm (3-pack) $10.00
MIST-ROSE Hydrating Rose Water Face Mist $24.00


Extract the order details from the message and return a JSON object
matching the schema below. Output only the valid JSON.

{
"customer": {
"id": "",
"name": ""
},
"order": {
"items": [
{"sku": "", "qty": }
],
"total_usd":
}
}

Message = {customer_message}

The LLM produces a text output, and we see it contains the correct JSON. We parse the text output as a JSON object, and feed it into the order database. It works.

Encouraged by this success, we ship to production. When customer messages start pouring in, the code breaks for 20% of them.

Why this fails

Upon inspecting errors, we find the LLM outputs were bad. Let’s understand why the LLM might have produced them:

JSON Output
Sure! Here is the JSON object for the order:

{
"customer": {
"id": "C-10322",
"name": "Ariana Reed"
},
...
}

Hope this helps!

In its training data, the LLM has seen helpful assistants preface code with polite text. The tokens Sure, !, and Here had high probabilities. The model picked them to match the pattern of a conversation. It does not know that this text breaks our parser.

What do we conclude after looking at these outputs?

LLMs are dumb

They do not think or reason. They simply:

  1. read the text in our prompt
  2. match it with patterns learnt in training
  3. pick the "most likely" words to follow

When we ask "What is the capital of France?", an LLM gives the output "Paris". It does not do this because it knows geography. It does this because “Paris” is the word that usually follows that question in books and articles. It matches the input with patterns it has seen and learnt in training.

The same applies to our chatbot example. We give the prompt and the LLM starts producing the output one token at a time. While it is producing a token,

  • It doesn't understand our prompt.
  • It doesn't understand we asked for a specific JSON object.
  • It doesn't understand some tokens will break our parser.

It only understands it needs to pick the "most likely" token.

Research by Apple's AI scientists found no evidence of formal reasoning in LLMs, and suggests that LLMs are just sophisticated pattern matchers.