The problem

It is now easy to understand why structured outputs are hard for LLMs. We'll use a quick example to demonstrate this.

The example

We deploy a chatbot in our online store. The chatbot talks to customers and places orders on their behalf. Here’s a customer message:

“Hi, I want to order three bottles of the citrus body wash and one lavender candle.”

To add this order to our database, we need to extract order details in a specific JSON format:

{
  "customer": {
    "id": "C-10322",
    "name": "Ariana Reed"
  },
  "order": {
    "items": [
      {"sku": "BW-CITRUS", "qty": 3},
      {"sku": "CANDLE-LAVENDER", "qty": 1}
    ],
    "total_usd": 54.00,
    "discount_usd": 10.00
  }
}

The solution

We decide to use an LLM, and write this prompt:

Here are my store listings:

SKU ID		Product Name			Unit Price (USD)
BW-CITRUS	Citrus Refresh Body Wash	$12.00
BW-EUCALYPT	Eucalyptus & Mint Body Wash	$12.00
CANDLE-LAVENDER	Relaxing Lavender Soy Candle	$18.00
CANDLE-SANDAL	Sandalwood & Amber Candle	$20.00
LOTION-SHEA	Raw Shea Butter Hand Cream	$15.00
SOAP-OATMEAL	Exfoliating Oatmeal Bar Soap	$8.00
DIFF-BERGAMOT	Bergamot Essential Oil Diffuser	$45.00
SCRUB-SUGAR	Brown Sugar Body Scrub		$22.00
BALM-PEPPER	Peppermint Lip Balm (3-pack)	$10.00
MIST-ROSE	Hydrating Rose Water Face Mist	$24.00


Extract the order details from the message and return a JSON object 
matching the schema below. Output only the valid JSON.

{
  "customer": {
    "id": "",
    "name": ""
  },
  "order": {
    "items": [
      {"sku": "", "qty": }
    ],
    "total_usd": 
  }
}

Message = {customer_message}

The LLM produces a text output, and we see it contains the correct JSON. We parse the text output as a JSON object, and feed it into the order database. It works.

Encouraged by this success, we ship to production. When customer messages start pouring in, the code breaks for 20% of them.

Why this fails

Upon inspecting errors, we find the LLM outputs were bad. Let’s understand why the LLM might have produced them:

JSON Output
Sure! Here is the JSON object for the order:

{
"customer": {
  "id": "C-10322",
  "name": "Ariana Reed"
},
...
}

Hope this helps!

In its training data, the LLM has seen helpful assistants preface code with polite text. The tokens Sure, !, and Here had high probabilities. The model picked them to match the pattern of a conversation. It does not know that this text breaks our parser.

1 / 6

What do we conclude after looking at these outputs?

LLMs are dumb

They do not think or reason. They simply:

read the text in our prompt
match it with patterns learnt in training
pick the "most likely" words to follow

When we ask "What is the capital of France?", an LLM gives the output "Paris". It does not do this because it knows geography. It does this because “Paris” is the word that usually follows that question in books and articles. It matches the input with patterns it has seen and learnt in training.

The same applies to our chatbot example. We give the prompt and the LLM starts producing the output one token at a time. While it is producing a token,

It doesn't understand our prompt.
It doesn't understand we asked for a specific JSON object.
It doesn't understand some tokens will break our parser.

It only understands it needs to pick the "most likely" token.

1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… pic.twitter.com/yli5q3fKIT
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024

Research by Apple's AI scientists found no evidence of formal reasoning in LLMs, and suggests that LLMs are just sophisticated pattern matchers.

The example​

The solution​

Why this fails​

LLMs are dumb​

The example

The solution

Why this fails

LLMs are dumb