Retrieval is critical in AI agents. To do any task correctly, AI agents needs to be able to retrieve all the information that is relevant to the task from its memory.

Enterprise AI agents need context graphs to do this. This post discusses -

  1. why AI agents need decision traces
  2. how context graphs capture them
  3. how context graphs can be used by AI agents
  4. what are the quantifiable improvements

An agent failure case

It's the last week of the quarter. A renewal agent is working a $480k account. The customer wants 20% off or they walk. The agent's policy caps renewals at 10%.

Now what?

A human doing that task will use their experience instead of the policy document.

Didn't we do this exact thing with Globex last quarter? It was a similar story. Big Fortune 500 logo, threatening to churn, and someone signed off on 18% because there'd been three SEV-1 outages on their account that month. While Finance grumbled, they approved it anyway. The risk was worth taking on a $300k account. It worked and Globex renewed shortly after.

This reasoning chain that makes the decision is not written down anywhere your agent can read.

Salesforce will happily tell your agent that the Globex discount was 18%. It will not tell it that the number was an exception, who approved it, what justified it, or that a nearly identical situation is sitting in the agent's queue right now. The what is spread across six systems of record. The why lives in three people's heads and a Slack thread that already rolled off retention.

So your agent does one of two things. It does the dumb-but-compliant thing, caps at 10%, and you lose the account. Or it escalates to a human, who spends forty minutes doing Slack archaeology to reconstruct a decision the company already made once. Either way, the organization failed to realize the benefit of adopting an AI agent.

This is the gap. We have gotten extremely good at recording what changed, i.e. the parts, prices, suppliers, statuses, approvals. We systematically throw away why it changed. And "why" turns out to be the one thing your agents need most and have least.

This is the gap people have started calling a context graph. Foundation Capital called it a trillion-dollar opportunity, which is the kind of phrase that usually sends people reaching for the back button. Stick around anyway. The term is new and a little overloaded, but it points at something real. If you ship or use agents, you will end up employing a context graph soon.

Flat context

The bad way to give your AI agents context is to just give them all the data, records, documents, rules and policies in a flat context window.

Dump the invoice, PO, vendor record, contract, policy document into the context. Then watch the agent fail. Why?

1. Context rot

When the agent asks itself “can I pay invoice #842?” To answer, it has to hop - which PO does this invoice reference? does that PO still have budget? was the delivery received? is the vendor on payment hold? does $12,400 cross the approval threshold? what do the contract's payment terms say? is the policy doc accurate and up-to-date? are there undocumented nuances related to this invoice payment currently living on Slack messages and Zoom call transcripts?

Flat retrieval tries to hand this massive amount of information to the LLM model in a pile of disconnected chunks. Some invoice text here, some PO text there, some language from random docs and slack channels, all given as a flat wall of text.

“Can I pay invoice #842?”

SAPPurchase Order · ME23N
7731
Acme Corporation Ltd
12,400.00
3,600.00
Posted · 12 Mar
Open
W policy_v4.docx
AP Approval Policy
▸ Invoices ≥ $10,000 need manager sign-off
# ap-ops
D

Dana 10:21

Acme always pays end of quarter — don’t chase them.

W

Wes 10:22

noted 👍 leaving #842 as-is

payment_hold.csv
A · VendorB · On hold
1Acme Corporation LtdNo
2Globex LLCYes
3InitechNo
PDF contract_acme.pdf
Payment terms: NET-60
● REC Renewal call · transcript 14:02
Sales Gave Acme Net-60 this renewal.
AP lead Paid late twice… but it’s a $40k account. Approved.

The model, in its inference run, is forced to re-derive every one of those connections from scratch on each turn. Flattening business data into loose text destroys exactly the structure it needs.

And as context size becomes large, LLM models fail in their tasks. They fail to follow instructions, drop rules randomly, misunderstand the relation between two pieces of information far apart, ignore middle-of-context data, apply rules and constraints out of order.

Surge AI documents this in their instruction-following benchmark. The best frontier model solves <41% of such complex tasks.

2. Lack of decision traces

AI Agents are bound to run into the same ambiguity humans resolve every day with learned judgment and organizational memory. But the inputs that form these judgements and organizational memory aren't stored in a flat context window.

  • Tribal knowledge. "We always give healthcare companies an extra 10% because their procurement cycles are brutal." That's not in the CRM. It's tribal knowledge passed down through onboarding and side conversations.
  • Past decisions. "We structured a similar deal for X last year where they split payments into installments, so we should offer this similar account Y the same terms." No system links the two deals to convey why Y's contract was drafted this way.
  • Context across systems of record. The account manager sees usage sliding in the product dashboard, an unpaid invoice in NetSuite, and a cold one-line reply on the renewal thread, then flags customer as "churn risk" in the CRM. This reasoning happened in their head. The context just says "churn risk".
  • Manual approvals. A VP approves a discount on a Zoom call or in a Slack DM. The opportunity record shows the final price. It doesn't show who approved the deviation or why this one-time exception was made.

In a flat context window, the reasoning connecting data to decisions/actions isn't treated as data in the first place. 

Every developer has hit the same wall in their own codebase. Why did we pick this queue over that one in 2019? Why is there a sleep(200) in the retry path that breaks everything when you remove it? It was obvious to whoever wrote it, and it's gone now. Remember Architecture Decision Records? We invented them to fix exactly this. But most ADR folders die at three entries, because writing them is friction and nobody reads them later.

That is the real shape of the problem. Companies are good at storing what happened. They are bad at storing why, because the why is unstructured, it spans systems, and until recently nothing could read it back even if you saved it.

Both failures, the flattened web and the missing why, have the same fix.

So what is a context graph?

Two ideas, and neither is exotic.

First, store every link once. Treat the things your business cares about as nodes: invoices, POs, vendors, contracts, policies, people. Treat the relationships as edges: this invoice references that PO, that PO is fulfilled by this receipt, this vendor signed that contract. Now the agent doesn't rebuild the web each turn. It walks it. It pulls the small slice a decision touches and leaves the other ten thousand records out of the window. Context rot goes away, because the window stays small and on-point.

“Can I pay invoice #842?”

trace

Second, store the why as data. The unit here is a decision trace: a structured record of one decision and everything that led to it. Not just "Acme moved to Net-60" but the problem that triggered it, the options weighed, why the rejected ones lost, who decided, and the reasoning they used.

decision_trace · DT-118
outcomeAcme → Net-60
problemAcme asked for longer terms on the renewal call
options ✗ Hold at Net-30 — risked losing the renewal ✓ Net-60 — chosen ✗ Net-90 — too much exposure given late history
constraintsAcme paid late twice last year · exception rule lets the AP lead sign off
decided_byPriya (AP lead) — a $40k account judged worth the float; cheaper than losing the renewal
captured_aton the call, the moment it was decided
Tap a field to open its exact source
Every value above was captured from a real system of record — not retyped later. Tap any field to jump to where it lives.
SAP Vendor Master · XK03
Acme Corporation Ltd
NET-60
NET-30
today · by P. Nair
● REC Renewal call · transcript 14:02
Acme To renew, we’d need to move to 60-day payment terms.
Sales Understood — let me run it past finance.
# deal-desk
S

Sales 14:31

Acme wants Net-60 to renew. Options?

F

Finance 14:33

Net-30 is safer but they may walk. Net-90 too exposed given the late history.

P

Priya · AP lead 14:35

Net-60 it is — I’ll sign off under the exception rule. ✓

ERP Acme · Payment history (AR aging)
InvoiceDuePaidStatus
#60214 Jan+14 daysLate
#58802 Nov+9 daysLate
#57020 Sepon timePaid
policy_v4.docx · exception rule → AP lead may sign off
AP Approvals · DT-118
Priya Nair · AP lead
Approved
“$40k account — worth the float vs. losing the renewal.”
Tue · 14:06
31 Calendar
Acme — Renewal call
Tue · 14:00 – 14:30 · Zoom
● Recording + AI notes saved at 14:02

This is what a decision trace stores. This is also what an employee keeps in their head. But with a context graph of decision traces, an agent (harness) can read it.

A context graph is those two things together. Entities and relationships, plus a decision trace on every choice that mattered, stitched across systems and time. Foundation Capital's one-liner for it is a "system of record for decisions" instead of objects. That lands. Most of your systems store the current state of things. A context graph stores how the state got that way.

Example schema of a decision trace:

The decision-trace schema, centered on what matters

DecisionTrace is the hub — every person, policy, part, and vehicle hangs off it by a named relationship. PK = primary key.

Now the second half is turning old decisions into something the agent can lean on.

Capture it on the way in

You capture the why at the moment the decision is made, on the write path, not by mining emails afterward. This sounds like an implementation detail, but it is the most critical aspect of your context graph. Reconstructing context after the fact is guesswork: the meeting is over, the Slack thread scrolled away, the person left.

Capture it in the flow and you get the real reason at full fidelity, for almost no extra effort. When a human overrides the agent's refund denial, that override is the moment to ask why and store the answer. Right then, while it's still true.

record_decision() — how one call fans out

Solid = request, dashed = response. The MCP tool orchestrates every hop — the engineer only sees the opening call and the final decision ID.

This is also why agents change the economics of organizational memory. We have always known we lose the why. Wikis, Confluence, post-mortems, ADRs: every one of them tries to save it, and every one decays, for the same two reasons. Writing it down is friction, and nobody reads it back.

Agents break both at once. The agent sits in the execution path, so capture is a side effect of doing the work, not a chore bolted on afterward. And the agent is a tireless reader that will happily consult ten thousand past decisions before making the next one.

Organizational memory finally has a reader worth writing for. That flips it from a cost you nag people about into an asset that compounds.

Stored decisions become precedent

Once decisions live in the graph, search turns them into precedent.

You embed each decision and match on meaning, so "missed three deliveries" surfaces "late shipments" and "vendor SLA breach" with no shared words. A new case shows up, the agent pulls the closest past cases, sees what was decided and why, and proposes the same, with a citation.

Finding precedents by meaning, not keywords

The new decision is turned into a vector and matched by cosine similarity — surfacing past decisions that mean the same thing, ranked by closeness.

So you use vector embeddings to find semantically similar decisions. Then apply graph-based filters to narrow by entity type, supplier, category, date range, or outcome.

A pile of old decisions becomes memory the agent can actually use. This is also how an agent improves without anyone retraining it.

The agent handling a refund pulls up similar past refunds and sees what was decided, and why. The match works on meaning, not exact words, so "missed three deliveries" also surfaces "late shipments" and "vendor SLA breach". A pile of stored decisions turns into memory the agent can use to self improve against the end goal.

new case
Globex wants Net-60
  • paid late once last year
  • $500k renewal at risk
  • #348 on the Fortune 500
≈ 0.86case match
precedent · Decision Trace 118
Acme → Net-60
  • late twice but still granted
  • $540k judged worth the risk
  • #211 on the Fortune 500. Great logo.
The agent reads the precedent and proposes to grant Net-60 under the exception rule.
new case
Quote for MediCorp
  • segment: healthcare
  • 9-month procurement cycle
  • net-new logo
≈ 0.92same segment
Decision trace 07
ABC Pharma → +10% buffer
  • segment: healthcare
  • procurement cycles are brutal
  • build +10% into the quote
  • captured at onboarding zoom call transcript in call_id 4329
This does not show up in the CRM. It just lives in veterans’ heads. Reference a maatching decision trace, the agent adds the 10% buffer automatically and can say why whenever asked.
new case
Account Q · renewal due
  • usage −30%Product
  • invoice 38d overdueNetSuite
  • one-line cold reply
≈ 0.84case match
Decison trace 151
Account Z → churn risk
  • usage slid 28%Product
  • unpaid invoice for 37 daysNetSuite
  • cold renewal reply
  • CSM changed status to churn risk
The flat CRM field just says “churn risk.” The decision trace keeps the three signals across Product, NetSuite and email — that triggered the status change. So the agent reassembles the same picture for Q and changes its status to “churn risk”.

There's a second payoff here. Because traces record exceptions, not just the clean path, you can see when a rule keeps getting overridden. If AP grants the same late-payment exception to twenty vendors, the policy is wrong, not the vendors. The graph turns that pattern into a signal to fix the underlying rule itself.

0times the Net-60 exception
was granted this quarter
⚑ Pattern detected
→ raise the Net-60 threshold

Over time, the context graph becomes the real source of truth for autonomy, and your company can easily audit and debug this autonomy.

The recent ACE paper, "Agentic Context Engineering", makes the mechanism concrete:

Treat the accumulated context as a playbook that grows through generation, reflection, and curation, and let real outcomes refine it. The agent gets better by editing what it knows, not by touching a single weight. A correction today becomes a rule tomorrow. A trace today becomes precedent next quarter. This feedback loop enables learning in agents.

Example of an agent using context graphs

How the agent finds — and uses — a precedent

STEP 1 / 6
The decision that needs a precedent
 

Hover the panel and use ← / → — or tap the arrows — to step through.

Isn't this just a knowledge graph with extra steps?

Fair question. The honest answer: the parts are old, the combination is new.

Knowledge graphs have been around since Google shipped one in 2012. Vector search and RAG are standard issue. Event sourcing, storing the sequence of events instead of just the latest state, is a pattern any backend engineer already knows. A context graph is close to event sourcing for decisions, where each event drags along its rationale and its links to everything it touched, and the store is a graph so you can walk relationships and search by meaning.

So no, there's no new primitive here. What's new is narrow and real. You capture the why on the write path, as structured data, because for the first time there's a consumer hungry enough to make it worth the trouble. The novelty isn't the graph. It's that agents finally give the why somewhere to be used.

RAG, for contrast, retrieves documents that look similar to your question. A context graph retrieves decisions, with their reasoning and their edges to everything they affected. One hands the model more text to read. The other hands it structure to walk and precedent to reason from.

Why your CRM won't just do this for you

Systems of record probably have it wrong. Salesforce released Agentforce, ServiceNow released Now Assist, Workday is doing something similar. Their reasoning is to add intelligence where the data resides.

But their agents will inherit the exact same limitations as their parents.

  1. Salesforce only stores the current state. It knows what a deal looks like today, but not what it looked like last week. When someone approves a discount, the system drops the reasons why. You cannot see the exact context of the choice.
  2. These systems also miss data. A support ticket doesn't just live in Zendesk. It needs user tiers from CRM, SLA terms from billing, recent outages from PagerDuty, Slack thread flagging churn risk. No single system of record sees the whole picture.
Systems of records can build their own agents, lock down APIs (ahem ahem), and adopt egress fees, but they can't insert themselves into an orchestration layer they were never part of.

When an agent triages an escalation, responds to an incident, or decides on a discount, it pulls context from multiple systems and time periods. The orchestration layer sees the full picture: what inputs were gathered, what policies applied, what exceptions were granted, and why. Because it's executing the workflow, it can capture that context at decision time instead of bolting on governance afterwards.

This is the essence of a context graph, and that will be the single most valuable asset for your company in the era of AI.

The hard parts

Here's where it gets difficult -

Garbage in, garbage precedent. If the captured rationale is lazy ("approved, see Slack"), your precedents are landfill. The graph is worth exactly the quality of the why you put in it, and writing a good why is real work, even with the prompt sitting right there.

Who writes the trace. If a human has to type thoughtful rationale every time, you've reinvented the wiki and it will rot the same way. If the agent infers the rationale, you have to trust the inference, and "the model guessed why we did this" is a shaky base for an audit. The real answer is somewhere in between, and getting that split right is most of the engineering.

The decision swamp. We turned data lakes into data swamps by dumping everything in with no schema and no curation. A graph of millions of contradictory, half-true traces is the same failure with extra edges. Without curation, more traces make precedent search worse, not better. ACE has a name for the small-scale version of this, context collapse, where rewriting erodes detail over time. Same disease, bigger blast radius.

And the plain one: this is early. Most vendor decks make it sound shipped. It mostly isn't. The pattern is sound and the early results are good, but "good early results" is not "proven," and anyone who tells you otherwise is pitching.

The full stack

To recap, you have four layers in an AI-native workflow -

  1. At the bottom sit your systems of record. Salesforce, SAP, Zendesk, GitHub, Slack, the Zoom transcript from this morning's call. They hold the raw state of your business. Scattered and current-state-only, but they're the ground truth for what is.
  2. Above them runs the harness. It sits in the execution path and runs the reason → act → observe loop. It holds the tools, picks what goes into the model on each step, stores corrections as memory, checkpoints long runs, enforces permissions, logs every decision, and catches errors before they crash the run. This engine turns a stateless LLM into something that finishes work.
  3. As the harness runs, it builds the context graph. Every decision it routes leaves a trace: what inputs it gathered, which rule it applied, what exception it took, who approved, and why. The graph stitches those traces across entities and time. It answers "why," and over time it becomes the real source of truth.
  4. At the top sit the agents and the humans. The agents do the bulk of the net work. Humans stay in the loop for the calls the agent flags as uncertain. Every correction a human makes flows back down into memory and the graph, so the next case runs cleaner.

The four layers of an AI-native company

Every correction a human makes flows back down into memory and the graph, so the next case runs cleaner.

This actually maps to the two core features of an AI-native workflow -

  • Universal context is your systems of record made queryable through the context graph. The agent doesn't re-derive the links between an invoice, a PO, a contract, and a Slack message on every turn. The graph already holds them.
  • Loops are the harness closing feedback on every run. A correction today becomes a rule tomorrow. A trace today becomes precedent next quarter.

Where to start

Don't try to automate everything at once. Pick one workflow/team and prove it before you expand. Try to pick a workflow/team with one or more of these three traits -

  1. High headcount, because that labor exists to handle messy logic.
  2. Exception-heavy decisions, because precedent matters a lot there.
  3. Cross-functional roles, because they exist just to carry context that no other system holds currently.

If all three line up, that's your first target. Procurement, finance, claims, deal desk, underwriting, escalation management are few examples.

Don't be afraid to tokenmaxx. If your monthly AI usage bill doesn't make you uncomfortable these days, you are doing something wrong. Plus, you can more than compensate for your AI bills today by saving in your payrolls tomorrow.

Conclusion

A clean way to hold all of this in your head:

The model is your brain, the agent / agentic harness is your limbs, and the context graph is the map of your specific world (company). A superb body with no map of your world stalls at every fork in the road that requires knowing the map, and enterprise processes are nothing but those forks.

So the question worth asking about your own company is small and uncomfortable. Where do you remember why things happened, and what actually worked? If the honest answer is "in a few people's heads," you already know where the risk is, and you already know what's worth building.

Capturing the “why” behind decisions is the next great leap in business intelligence.