Nanonets Nanonets Benchmark results
Nanonets Research

Complex Constraints

How Nanonets handles entangled instruction following with context graphs: persistent, inspectable state for long, conditional, and conflicting requirements.

Parse prompt Explicit · budget ≤ $4.2k Conditional · if booked Negative · not Priya/Mon Precedence · budget wins Implicit · ≤ 5 days/wk Verify & return
Overview

What ComplexConstraints measures

Most instruction-following benchmarks test isolated rules. ComplexConstraints, built by Surge AI, tests entangled ones: real professional tasks where satisfying one requirement changes what the others demand, and early errors cascade downstream.

75expert-crafted prompts
1,559evaluation rubric items
15-40interlocking rubrics / prompt
<41%best frontier-model score

The six constraint types

Conditional

Rules that fire only under specific circumstances and stay dormant otherwise.

Planning

Finding one arrangement that satisfies many simultaneous requirements.

Multistep

Sequential reasoning where an early mistake cascades downstream.

Negative

Do-not-do constraints that must remain active through generation.

Implicit

Requirements inferred from context rather than stated as direct commands.

Format

Output-shape rules that interact with content, ordering, and exceptions.

Benchmark

Nanonets vs. the frontier.

Graded LLM-as-judge against each prompt's full rubric. Nanonets satisfies 90.0% of individual constraints (rubric pass rate) and fully solves 45.0% of prompts (task pass rate) — ahead of the strongest public model's 40.4% task pass rate.

90.0%Rubric pass rate — share of individual constraints satisfied
45.0%Task pass rate — share of prompts fully solved (+4.6 vs best public)
#1overall on entangled instruction following
# Model Task pass rate
1
Nanonets context graphsNanonets
45.0%
2
Gemini 3.1 ProGoogle
40.4%
3
GPT 5.5OpenAI
38.7%
4
Gemini 3.5 FlashGoogle
36.9%
5
Qwen 3.7 MaxAlibaba
36.0%
6
Claude Opus 4.8Anthropic
34.9%
7
Kimi K2.6Moonshot AI
34.0%
8
Claude Opus 4.7Anthropic
33.6%
9
DeepSeek V4 ProDeepSeek
26.7%
10
Kimi K2.5Moonshot AI
18.7%
11
Grok 4.20 BetaxAI
16.9%
12
DeepSeek V4 FlashDeepSeek
16.4%
13
Qwen 3.5 PlusAlibaba
16.0%
14
Ernie 5.1Baidu
15.2%
15
GPT 5.4OpenAI
4.9%
16
DeepSeek v3.2DeepSeek
1.8%
17
Mistral LargeMistral AI
0.4%
18
Ernie 4.5Baidu
0.0%
18
Nova 2 ProAWS
0.0%

Source: Surge AI's public ComplexConstraints model rankings (last updated 3 June 2026). The Nanonets figure is from our own internal evaluation.

Architecture

A context graph, not a longer prompt.

Frontier models hold entangled constraints in a flat context window. As the rule count climbs, constraints get silently dropped, double-counted, or applied out of order. Nanonets parses the instruction into a context graph: every constraint is a node, and the edges encode how they relate.

Flat prompt
SYSTEM — STAFF SCHEDULING, WEEK OF JUNE 22 Build a complete 7-day shift plan for a 60-cover restaurant that satisfies every constraint below. Output a table grouped by day, then a cost summary, then a compliance checklist. 1. Total weekly labour cost must not exceed $4,200, inclusive of overtime. 2. Every shift must include at least one Spanish-speaking server on the floor. 3. If Saturday dinner is fully booked, add a second line cook for that shift only. 4. Never schedule Priya on a Monday; she has a standing availability conflict. 5. No employee may work more than 5 days, or more than 40 hours, in the week. 6. Assign shift leads first, then fill support roles, then verify total cost. 7. Brunch (Sat/Sun, 9am-2pm) requires exactly 2 cooks and 3 servers. 8. If a server works a closing shift, they may not open the next day. 9. At least one certified first-aider must be present during every open hour. 10. Dinner needs a minimum of 1 host until 9pm, after which it is optional. 11. Overtime is only permitted for staff explicitly flagged “OT-eligible”. 12. Pastry must be covered whenever brunch OR dinner dessert is on the menu. 13. Do not assign trainees unless a senior cook is on the same shift. 14. Wednesday is deep-clean night; schedule 1 extra closer regardless of covers. 15. If a day's covers exceed 120, raise that day's server count by one. 16. Keep 8+ hours of rest between any employee's two consecutive shifts. 17. Holiday-pay rules apply if June 22 is a public holiday in the region. 18. The bar must be staffed by a TIPS-certified bartender while alcohol is served. 19. No more than 2 trainees may be on the floor during any single shift. 20. Prefer staff living within 5km for shifts ending after 11pm. 21. The same lead may not both open and close on one calendar day. 22. If budget (rule 1) conflicts with the extra server (rule 15), budget wins. 23. Output each assignment with name, role, start, end, and rule IDs satisfied. … (+ 142 more constraints, conditions, exceptions and overrides)
Context graph
Parse prompt Explicit · budget ≤ $4.2k Conditional · if booked Negative · not Priya/Mon Precedence · budget wins Implicit · ≤ 5 days/wk Verify & return
Drag the divider — the same instruction as a flat prompt (left) vs. a traversable context graph (right).
  1. Extract constraints. Split the prompt into explicit rules, implied rules, forbidden actions, output requirements, and conditional branches.
  2. Link dependencies. Add edges between constraints that activate, override, narrow, or contradict one another.
  3. Draft against the graph. Generate with active constraints attached to each section of the answer, so global rules do not disappear.
  4. Verify before return. Check the completed answer against the graph and repair violations before the final response.
Example

One prompt, eighteen interlocking constraints.

A restaurant-scheduling prompt from the benchmark asks a model to staff a week of shifts under dietary, language, budget, and availability rules that all interact. Here is a slice of what the graph tracks.

restaurant-scheduling.prompt - 18 rubric items
Explicit
Keep total staffing cost under the weekly budget.
tracked
Conditional
If a VIP reservation appears, add a second cook for that service.
tracked
Negative
Do not assign anyone with a nut allergy to pastry prep.
tracked
Ordering
Resolve availability before applying language coverage and budget rules.
tracked
Where a flat-context model resolves the budget rule and then quietly violates it two steps later when the conditional second cook fires, the graph re-checks the budget edge the moment the conditional node activates, so the constraint survives the cascade.
Methodology

How the benchmark is scored.

ComplexConstraints contains 75 prompts and 1,559 rubric items. Each model output is graded against the prompt's full rubric, scoring the fraction of items passed. The page reports the overall pass rate across prompts.

ComplexConstraints is released by Surge AI under CC-BY-4.0. The public dataset is available on Hugging Face, and the model ranking screenshot supplied for this page was last updated June 3, 2026.
Jump to results Dataset Benchmark blog