Nanonets on ComplexConstraints

Overview

What ComplexConstraints measures

Most instruction-following benchmarks test isolated rules. ComplexConstraints, built by Surge AI, tests entangled ones: real professional tasks where satisfying one requirement changes what the others demand, and early errors cascade downstream.

75expert-crafted prompts

1,559evaluation rubric items

15-40interlocking rubrics / prompt

<41%best frontier-model score

The six constraint types

↪

Conditional

Rules that fire only under specific circumstances and stay dormant otherwise.

□

Planning

Finding one arrangement that satisfies many simultaneous requirements.

⌁

Multistep

Sequential reasoning where an early mistake cascades downstream.

⊘

Negative

Do-not-do constraints that must remain active through generation.

◌

Implicit

Requirements inferred from context rather than stated as direct commands.

▤

Format

Output-shape rules that interact with content, ordering, and exceptions.

Benchmark

Nanonets vs. the frontier.

Graded LLM-as-judge against each prompt's full rubric. Nanonets satisfies 90.0% of individual constraints (rubric pass rate) and fully solves 45.0% of prompts (task pass rate) — ahead of the strongest public model's 40.4% task pass rate.

90.0%Rubric pass rate — share of individual constraints satisfied

45.0%Task pass rate — share of prompts fully solved (+4.6 vs best public)

#1overall on entangled instruction following

#	Model	Task pass rate
1	Nanonets context graphsNanonets	45.0%
2	Gemini 3.1 ProGoogle	40.4%
3	GPT 5.5OpenAI	38.7%
4	Gemini 3.5 FlashGoogle	36.9%
5	Qwen 3.7 MaxAlibaba	36.0%
6	Claude Opus 4.8Anthropic	34.9%
7	Kimi K2.6Moonshot AI	34.0%
8	Claude Opus 4.7Anthropic	33.6%
9	DeepSeek V4 ProDeepSeek	26.7%
10	Kimi K2.5Moonshot AI	18.7%
11	Grok 4.20 BetaxAI	16.9%
12	DeepSeek V4 FlashDeepSeek	16.4%
13	Qwen 3.5 PlusAlibaba	16.0%
14	Ernie 5.1Baidu	15.2%
15	GPT 5.4OpenAI	4.9%
16	DeepSeek v3.2DeepSeek	1.8%
17	Mistral LargeMistral AI	0.4%
18	Ernie 4.5Baidu	0.0%
18	Nova 2 ProAWS	0.0%

Source: Surge AI's public ComplexConstraints model rankings (last updated 3 June 2026). The Nanonets figure is from our own internal evaluation.

Architecture

A context graph, not a longer prompt.

Frontier models hold entangled constraints in a flat context window. As the rule count climbs, constraints get silently dropped, double-counted, or applied out of order. Nanonets parses the instruction into a context graph: every constraint is a node, and the edges encode how they relate.

Flat prompt

SYSTEM — STAFF SCHEDULING, WEEK OF JUNE 22 Build a complete 7-day shift plan for a 60-cover restaurant that satisfies every constraint below. Output a table grouped by day, then a cost summary, then a compliance checklist. 1. Total weekly labour cost must not exceed $4,200, inclusive of overtime. 2. Every shift must include at least one Spanish-speaking server on the floor. 3. If Saturday dinner is fully booked, add a second line cook for that shift only. 4. Never schedule Priya on a Monday; she has a standing availability conflict. 5. No employee may work more than 5 days, or more than 40 hours, in the week. 6. Assign shift leads first, then fill support roles, then verify total cost. 7. Brunch (Sat/Sun, 9am-2pm) requires exactly 2 cooks and 3 servers. 8. If a server works a closing shift, they may not open the next day. 9. At least one certified first-aider must be present during every open hour. 10. Dinner needs a minimum of 1 host until 9pm, after which it is optional. 11. Overtime is only permitted for staff explicitly flagged “OT-eligible”. 12. Pastry must be covered whenever brunch OR dinner dessert is on the menu. 13. Do not assign trainees unless a senior cook is on the same shift. 14. Wednesday is deep-clean night; schedule 1 extra closer regardless of covers. 15. If a day's covers exceed 120, raise that day's server count by one. 16. Keep 8+ hours of rest between any employee's two consecutive shifts. 17. Holiday-pay rules apply if June 22 is a public holiday in the region. 18. The bar must be staffed by a TIPS-certified bartender while alcohol is served. 19. No more than 2 trainees may be on the floor during any single shift. 20. Prefer staff living within 5km for shifts ending after 11pm. 21. The same lead may not both open and close on one calendar day. 22. If budget (rule 1) conflicts with the extra server (rule 15), budget wins. 23. Output each assignment with name, role, start, end, and rule IDs satisfied. … (+ 142 more constraints, conditions, exceptions and overrides)

Context graph

Drag the divider — the same instruction as a flat prompt (left) vs. a traversable context graph (right).

Extract constraints. Split the prompt into explicit rules, implied rules, forbidden actions, output requirements, and conditional branches.
Link dependencies. Add edges between constraints that activate, override, narrow, or contradict one another.
Draft against the graph. Generate with active constraints attached to each section of the answer, so global rules do not disappear.
Verify before return. Check the completed answer against the graph and repair violations before the final response.

Example

One prompt, eighteen interlocking constraints.

A restaurant-scheduling prompt from the benchmark asks a model to staff a week of shifts under dietary, language, budget, and availability rules that all interact. Here is a slice of what the graph tracks.

restaurant-scheduling.prompt - 18 rubric items

Explicit

Keep total staffing cost under the weekly budget.

tracked

Conditional

If a VIP reservation appears, add a second cook for that service.

tracked

Negative

Do not assign anyone with a nut allergy to pastry prep.

tracked

Ordering

Resolve availability before applying language coverage and budget rules.

tracked

Where a flat-context model resolves the budget rule and then quietly violates it two steps later when the conditional second cook fires, the graph re-checks the budget edge the moment the conditional node activates, so the constraint survives the cascade.

Methodology

How the benchmark is scored.

ComplexConstraints contains 75 prompts and 1,559 rubric items. Each model output is graded against the prompt's full rubric, scoring the fraction of items passed. The page reports the overall pass rate across prompts.

ComplexConstraints is released by Surge AI under CC-BY-4.0. The public dataset is available on Hugging Face, and the model ranking screenshot supplied for this page was last updated June 3, 2026.

Jump to results Dataset Benchmark blog

Complex Constraints

What ComplexConstraints measures

The six constraint types

Conditional

Planning

Multistep

Negative

Implicit

Format

Nanonets vs. the frontier.

A context graph, not a longer prompt.

One prompt, eighteen interlocking constraints.

How the benchmark is scored.