From Agent Swarms to Agent Control Planes
A builder’s guide to the emerging infrastructure layer that routes models, tools, memory, evaluators, policies, and execution environments.
Introduction
Agent building is changing. For the last few years, the dominant way to ship an agent was to hand-write a workflow: a chain of calls, a few prompt templates, some tool definitions, and hope the model behaves at runtime. That approach still works for narrow demos, but it does not scale. Teams are now asking the same infrastructure questions about every agent: Which model should answer this request? What happens when that model is down? Who checks the output? Where is the audit trail? How much did this cost?
The answer appearing in research, open-source tooling, and enterprise marketing is the agent control plane: a governed layer that sits between users and a pool of agents, models, tools, and policies, and decides at runtime how a request is handled. This article traces the lineage behind that idea—from Mixture-of-Experts and cost-aware cascades to test-time search, tool use, ensembling, learned orchestrators, and production gateways—and explains what a control plane actually does, why it matters, and where the engineering is still uncertain.
The claim is not that control planes are a finished product category. They are a shift in where control lives. Instead of hard-coding orchestration into every agent, teams are moving toward a shared layer that routes, observes, and governs. That shift has real consequences for builders, and it also has real gaps.
Why this matters now
Three pressures are converging.
First, model choice is no longer a one-time decision. A single product may use a cheap model for classification, a mid-size model for drafting, a reasoning model for hard problems, and a code model for execution. Picking one model per task by hand does not work when tasks overlap, models improve every quarter, and pricing changes with them.
Second, agent failures are operational failures. A bad answer in a chatbot is embarrassing; a bad tool call in an agent that can read a database or send email is an incident. Teams need fallbacks, guardrails, human handoffs, and audit logs. Those concerns belong to infrastructure, not to each agent’s prompt.
Third, the surface area of agent systems is expanding. Memory, evaluators, multi-agent debate, tool use, code execution, and policy enforcement are no longer research curiosities. They are parts of a production system. Without a control layer, the result is a tangle of hand-written workflows that are hard to test, harder to secure, and nearly impossible to reason about.
A control plane does not solve all of this. But it gives teams a place to put the cross-cutting concerns: routing, fallback, policy, memory, evaluation, observability, cost, and lifecycle governance.
Part I — Lineage: eight tributaries
The control-plane idea did not arrive from nowhere. It grew out of several research and engineering threads that are now merging.
1. Conditional computation: routing inside the model
Mixture-of-Experts (MoE) is the earliest ancestor. Shazeer et al. (2017) introduced a sparsely-gated MoE layer that routed each token to a small subset of thousands of feed-forward experts. Fedus et al. (2021) simplified this with Switch Transformers, using a single expert per token and showing large pre-training speedups. The lesson is old: not every part of a problem needs the same compute. Modern agent routing applies the same intuition across models, tools, and agents instead of across neural experts.
2. Cost-aware cascades: which model for which query?
Dohan et al. (2022) framed chain-of-thought, verifiers, and tool use as probabilistic programs composed from language models. Chen et al. (2023) pushed this further with FrugalGPT, learning which model combination to call for each query and reporting large cost reductions. Ong et al. (2024) showed that routers trained on preference data can transfer to new model pools, and Hu et al. (RouterBench, 2024) provided a benchmark for comparing them. The engineering takeaway is clear: a learned router can often match a strong model’s accuracy at a fraction of the cost.
3. Reasoning-time search: thinking harder at test time
Wang et al. (2022) showed that self-consistency—sampling multiple reasoning paths and voting—improves chain-of-thought reasoning. Yao et al. (2023) generalized this to Tree of Thoughts, where an explicit search tree lets the model backtrack. Shinn et al. (2023) added linguistic feedback and episodic memory in Reflexion, and Du et al. (2023) showed that multi-agent debate can improve factual accuracy. These strategies expand the space of what can happen at inference time. A control plane is the natural place to decide when to spend extra compute on search, reflection, or debate.
4. Tool use as scaffold
Yao et al. (2022) introduced ReAct, interleaving reasoning traces with tool actions. The pattern is now everywhere: an agent thinks, acts, observes, and repeats. But tool use also multiplies risk. A control plane can decide which tools an agent may call, under what conditions, and with what approval policy.
5. Model ensembling and fusion
LLM-Blender (Jiang et al., 2023) ranks and fuses outputs from multiple models. Mixture-of-Agents (Wang et al., 2024) layers agents so that each layer uses the previous layer’s outputs as auxiliary context. OpenRouter Fusion exposes a commercial fused endpoint. Ensembling raises the same governance question as routing: who decides which outputs are combined, and on what basis?
6. Learned orchestration: from code to scaffold generators
The most recent signal is a class of systems that learn to generate agent workflows. Sakana Fugu (Tang et al., 2026) trains orchestrator models to understand a query and dynamically generate agent teams and scaffolds. Trinity (Xu et al., 2025) uses a small coordinator to assign Thinker, Worker, and Verifier roles. Conductor (Nielsen et al., 2025) uses reinforcement learning to design communication topologies among agents. These systems do not replace human design, but they move it up a level: from writing every step to specifying objectives and constraints and letting the orchestrator propose a workflow.
This is where caution is needed. Fugu’s reported results on SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity’s Last Exam, and CharXiv Reasoning are eye-catching, but the paper is very recent and not yet independently reproduced. Benchmarks evolve, and high scores can reflect task leakage or cherry-picking. Fugu is a signal, not a settled fact.
7. Automated workflow design: search over topologies
AFlow (Zhang et al., 2024) uses Monte Carlo Tree Search over code-represented workflows. Hu et al. (Meta Agent Search, 2024) proposed an automated method to discover novel agent designs across domains. MASRouter (Yue et al., 2025) and AgentPrune (Li et al., 2024) optimize routing and communication graphs to cut cost while preserving performance. The common thread is that the workflow itself becomes an optimization target.
8. Runtime frameworks and production gateways
The implementation layer is already crowded. LangGraph, AutoGen, CrewAI, LlamaIndex Workflows, Haystack, OpenAI Agents SDK, and DSPy provide reusable primitives for building agents. Gateways such as LiteLLM, OpenRouter, Portkey, Kong, and Cloudflare AI Gateway add unified APIs, routing, fallback, caching, rate limiting, and spend tracking. Observability tools such as LangSmith, Arize Phoenix, Langfuse, and Helicone provide tracing, evaluation, and cost attribution. Enterprise platforms such as Microsoft Copilot Studio, Salesforce Agentforce, and ServiceNow AI Agents wrap these capabilities in low-code interfaces and governance consoles.
These tools are not all “control planes” in the same sense. Some are thin routers; some are thick frameworks; some are enterprise control towers. That ambiguity matters, and it is why the term needs a concrete definition.
Part II — What a control plane actually does
A useful working definition: an agent control plane is a runtime layer that makes cross-cutting decisions about how user requests are handled across a heterogeneous pool of models, tools, agents, policies, and memory systems.
The capabilities that define it can be grouped into seven areas:
| Capability | What it means in practice |
|---|---|
| Routing | Choose a model, agent, tool chain, or reasoning strategy for each request. |
| Fallback | Retry, switch models, or escalate when a call fails or violates policy. |
| Policy / guardrails | Enforce limits on outputs, tool calls, data access, and escalation. |
| Memory / state | Maintain context across turns, sessions, and agents without leaking between tenants. |
| Evaluation | Score outputs, detect hallucinations, and decide when to rerun or escalate. |
| Observability | Produce traces, cost attribution, audit logs, and latency metrics. |
| Lifecycle governance | Manage versions, deployments, approvals, and human-in-the-loop workflows. |
Not every product with “control plane” in its marketing covers all seven. In practice, three patterns dominate:
- Thin gateway. A routing and policy proxy, such as LiteLLM or OpenRouter. It unifies APIs, handles fallbacks, and tracks spend. It does not know much about the agent’s internal reasoning.
- Thick framework. A runtime such as LangGraph or AutoGen. It owns state, handoffs, and tool orchestration, but usually within a single codebase.
- Enterprise control tower. A platform such as Salesforce Agentforce or ServiceNow AI Agents. It adds governance consoles, connectors, approval workflows, and tenant isolation, often at the cost of flexibility.
A mature team will likely use more than one. The control plane is an architectural layer, not a single vendor product.
Part III — From agent swarms to governed planes
Hand-written agent workflows fail at scale for the same reason hand-written RPC routing failed at scale: every team reinvents the same cross-cutting logic, and no one can reason about the whole system. The symptoms are familiar:
- Prompt drift. Each agent has its own way of handling errors, retries, and tool failures.
- Shadow routing. A “quick fix” sends a request to a different model in one agent but not another.
- Blind spots. No one knows the total cost, latency distribution, or failure rate across agents.
- Policy gaps. Guardrails are implemented inconsistently, or not at all, in non-obvious execution paths.
Moving orchestration into a control plane does not eliminate these problems, but it centralizes them. A builder can write an agent that declares what it needs—tools, models, approval thresholds, cost budgets—and let the control plane resolve the rest.
Learned orchestrators such as Fugu, Trinity, and Conductor take this further. Instead of a human writing the scaffold, a smaller model generates or selects it. This is a meaningful change in abstraction, but it is also a new dependency. The orchestrator itself has latency, cost, failure modes, and training-data biases. Teams should treat it as infrastructure, not magic.
System structure: a reference architecture
The diagram below is a simplified reference architecture, not a product recommendation. It shows where a control plane sits and what it touches.
flowchart TB
subgraph User["User / Application"]
Req[Request]
end
subgraph CP["Control Plane"]
Router[Router / Orchestrator]
Policy[Policy Engine]
Eval[Evaluator]
Mem[Memory / State]
Obs[Observability]
Cost[Cost & Quotas]
end
subgraph Pool["Agent / Model Pool"]
M1[Small model]
M2[Reasoning model]
M3[Code model]
A1[Tool agent]
A2[Human-in-the-loop]
end
subgraph Tools["Tool & Data Layer"]
T1[Search / RAG]
T2[Code execution]
T3[Enterprise APIs]
end
Req --> Router
Router --> Policy
Router --> Eval
Router --> Mem
Router --> Cost
Router --> Obs
Router --> M1
Router --> M2
Router --> M3
Router --> A1
A1 --> T1
A1 --> T2
A1 --> T3
M2 --> A2
A2 --> User
M1 --> Obs
M2 --> Obs
M3 --> Obs
A1 --> Obs
In this architecture, the control plane is not one box; it is the set of responsibilities that connect the user to the pool of capabilities. The router may be a learned model, a rule-based cascade, or a hybrid. The policy engine enforces hard constraints. Evaluators decide whether an answer is good enough. Memory and observability make the system inspectable. Cost and quotas prevent runaway spend.
Part IV — Limits, cautions, and open problems
The shift toward control planes is real, but it is not without risks.
Very recent research is unproven. Fugu, Trinity, and Conductor are preprints or very recent papers. Their benchmarks are impressive but not independently reproduced. Treat them as signals of where the field is heading, not as production-ready recipes.
“Control plane” is a marketing term. Salesforce, ServiceNow, Microsoft, and infrastructure vendors use it differently. Some mean model routing; some mean agent lifecycle management; some mean enterprise governance. The article defines it explicitly, but readers should map vendor claims to concrete capabilities.
Benchmarks are directional, not definitive. SWE-Bench, Humanity’s Last Exam, and similar leaderboards evolve. High scores may reflect leakage, overfitting, or task-specific tuning. Do not choose an architecture based on a leaderboard alone.
The orchestrator adds cost and latency. A learned router or scaffold generator is itself a model call. End-to-end P99 latency and dollar-cost-per-task at production scale are under-reported. Benchmark your own workloads.
Governance and accountability are under-specified. When a control plane routes a request through a chain of models, tools, and policies, tracing liability, explainability, and consent is hard. There is no widely accepted standard for this yet.
Multi-tenancy is an engineering gap. Most research and open-source tooling focuses on a single user or task. Scheduling, isolation, and billing across tenants or teams are still mostly vendor-specific.
Protocol competition is unresolved. MCP, A2A, ACP, and ANP are competing interoperability protocols. Their relationship to control-plane governance is still emerging.
Practical takeaway: a builder checklist
If you are building or refactoring an agent system, consider these questions:
- Have you separated what the agent tries to do from how the request is routed, retried, and observed?
- Can you swap the model for a given task without changing the agent’s code?
- Do you have fallback rules for model outages, rate limits, and policy violations?
- Are guardrails enforced in one place, or scattered across prompts and tool wrappers?
- Can you trace the full path of a request, including cost, latency, and model/tool calls?
- Do you have evaluators that can reject or escalate an output before it reaches the user?
- Is memory scoped correctly across users, sessions, and agents?
- Have you set cost or token quotas, and do you alert when they are approached?
- Are human-in-the-loop rules defined for high-stakes tool calls or uncertain outputs?
- Have you benchmarked routing overhead on your own traffic, rather than trusting vendor claims?
If you cannot answer yes to most of these, a control plane is likely the next architectural investment worth considering.
Conclusion: a shift, not a product
Agent orchestration is moving from hand-written workflows toward governed control planes. That shift has deep roots in conditional computation, cost-aware routing, test-time search, tool use, ensembling, and learned orchestration. It is also being accelerated by a crowded landscape of frameworks, gateways, observability tools, and enterprise platforms.
The term “control plane” is contested, the vendors are noisy, and the research is moving fast. But the underlying idea is durable: teams need a shared layer that routes, observes, and governs agent behavior. The builders who benefit first will be the ones who treat that layer as infrastructure—defined by capabilities, not by product names—and who stay skeptical of both hype and premature standardization.
Sources
- Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” https://arxiv.org/abs/1701.06538
- Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” https://arxiv.org/abs/2101.03961
- Dohan et al., “Language Model Cascades.” https://arxiv.org/abs/2207.10342
- Chen et al., “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” https://arxiv.org/abs/2305.05176
- Ong et al., “RouteLLM: Learning to Route LLMs with Preference Data.” https://arxiv.org/abs/2406.18665
- Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” https://arxiv.org/abs/2305.10601
- Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” https://arxiv.org/abs/2210.03629
- Wang et al., “Mixture-of-Agents Enhances Large Language Model Capabilities.” https://arxiv.org/abs/2406.04692
- Tang et al., “Sakana Fugu: Orchestrator Models for Adaptive Agentic Scaffolds.” https://arxiv.org/abs/2606.21228
- Zhang et al., “AFlow: Automating Agentic Workflow Generation.” https://arxiv.org/abs/2410.10762
- LiteLLM documentation. https://docs.litellm.ai/
- LangGraph documentation (LangChain). https://www.langchain.com/langgraph
- Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” https://arxiv.org/abs/2203.11171
- Shinn et al., “Reflexion: Self-Reflective Agents with Dynamic Memory.” https://arxiv.org/abs/2303.11366
- Du et al., “Improving Factuality and Reasoning in Language Models through Multiagent Debate.” https://arxiv.org/abs/2305.14325
- Jiang et al., “LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion.” https://arxiv.org/abs/2306.02561
- Xu et al., “Trinity: Harmonizing Multiple Large Language Models as a Single Mind.” https://arxiv.org/abs/2512.04695
- Nielsen et al., “Conductor: Learning to Orchestrate LLM Agents via Reinforcement Learning.” https://arxiv.org/abs/2512.04388
- Hu et al., “RouterBench: A Benchmark for Multi-LLM Routing.” https://arxiv.org/abs/2403.12031
- Hu et al., “Automated Design of Agentic Systems.” https://arxiv.org/abs/2408.08435
- Yue et al., “MASRouter: A Multiplexing LLM Agent Router.” https://arxiv.org/abs/2502.11133
- Li et al., “AgentPrune: Reducing Communication Redundancy in Multi-Agent Systems.” https://arxiv.org/abs/2410.02506
Article guide Important points and sources 8 points Show guide Hide guide
- C001 core · medium Agent orchestration is shifting from hand-written workflows toward governed control planes that route across models, tools, memory, evaluators, policies, and execution environments.
- C002 core · high Mixture-of-Experts and conditional computation predate LLMs and provide the earliest architectural precedent for learned routing.
- C003 core · medium Learned routers such as FrugalGPT and RouteLLM can match or exceed single-model accuracy at a fraction of the cost, though benchmark caveats apply.
- C004 argument · medium Test-time search strategies—self-consistency, Tree of Thoughts, Reflexion, and multi-agent debate—expand what a control plane can spend compute on at runtime.
- C005 argument · medium Recent learned orchestrators such as Sakana Fugu, Trinity, and Conductor are signals of automated scaffold generation, not settled production recipes.
- C006 argument · medium Production control planes combine routing, fallback, policy, memory, evaluation, observability, and lifecycle governance, but no single vendor owns all of them.
- C007 argument · medium The term 'control plane' is contested across vendors; teams should judge products by concrete capabilities rather than marketing labels.
- C008 argument · medium Teams should treat model selection, fallback, observability, and policy enforcement as infrastructure concerns rather than per-agent code.
Sources Sources used 22 sources Show sources Hide sources
- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer paper
- Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity paper
- Dohan et al., Language Model Cascades paper
- Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance paper
- Ong et al., RouteLLM: Learning to Route LLMs with Preference Data paper
- Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models paper
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models paper
- Wang et al., Mixture-of-Agents Enhances Large Language Model Capabilities paper
- Tang et al., Sakana Fugu: Orchestrator Models for Adaptive Agentic Scaffolds paper
- Zhang et al., AFlow: Automating Agentic Workflow Generation paper
- LiteLLM documentation documentation
- LangGraph documentation (LangChain) documentation
- Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models paper
- Shinn et al., Reflexion: Self-Reflective Agents with Dynamic Memory paper
- Du et al., Improving Factuality and Reasoning in Language Models through Multiagent Debate paper
- Jiang et al., LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion paper
- Xu et al., Trinity: Harmonizing Multiple Large Language Models as a Single Mind paper
- Nielsen et al., Conductor: Learning to Orchestrate LLM Agents via Reinforcement Learning paper
- Hu et al., RouterBench: A Benchmark for Multi-LLM Routing paper
- Hu et al., Automated Design of Agentic Systems paper
- Yue et al., MASRouter: A Multiplexing LLM Agent Router paper
- Li et al., AgentPrune: Reducing Communication Redundancy in Multi-Agent Systems paper
Look closer
Sources and notes
Open details Close details
Look closer
Sources and notes
These notes collect the sources, counterpoints, and review status behind the article's important points. Read the essay first; open this when you want to check something.
Confidence reflects how strongly the sources support the point (low / medium / high). Status describes the point's role (e.g., core, argument, landscape). Sources link to supporting material; counterpoints note boundary conditions or conflicting findings.
Agent orchestration is shifting from hand-written workflows toward governed control planes that route across models, tools, memory, evaluators, policies, and execution environments.
- Sources (4)
-
-
“Shazeer et al. introduced a sparsely-gated Mixture-of-Experts layer that routes each token to a subset of experts, providing an early architectural precedent for learned routing across heterogeneous compute units.”
Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer background -
“Fedus et al. simplified MoE routing to one expert per token with Switch Transformers and showed large pre-training speedups, demonstrating that conditional computation can scale to trillion-parameter models.”
Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity background -
“LiteLLM exposes a unified proxy with model routing, fallback, spend tracking, and observability callbacks, treating cross-cutting orchestration concerns as infrastructure rather than per-agent code.”
LiteLLM documentation direct -
“LangGraph provides a graph-based state machine for multi-actor agent applications with persistence, handoffs, and human-in-the-loop, embodying framework-level orchestration.”
LangGraph documentation (LangChain) direct
-
- Counterpoints (2)
-
-
The shift is a trend, not a universal migration; many production agents still run as hand-written workflows without a separate control plane.
-
No canonical definition of 'control plane' exists, so vendor claims may describe very different capability sets, from thin routers to thick frameworks to enterprise governance consoles.
-
Mixture-of-Experts and conditional computation predate LLMs and provide the earliest architectural precedent for learned routing.
- Sources (2)
-
-
“Shazeer et al. introduced a sparsely-gated MoE layer that routes each input token to a small subset of thousands of feed-forward experts, making routing a learnable part of the network.”
Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer direct -
“Fedus et al. proposed Switch Transformers, which use a single expert per token and achieve large pre-training speedups, bridging conditional computation to modern large-model scale.”
Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity direct
-
- Counterpoints (2)
-
-
MoE routing is internal to a single model and trained end-to-end, whereas agent control planes route across independent models, tools, and agents with distinct latency, cost, and failure semantics.
-
Token-level gating optimizes for training throughput and capacity, while agent routing must also enforce policies, produce audit trails, and handle human escalation.
-
Learned routers such as FrugalGPT and RouteLLM can match or exceed single-model accuracy at a fraction of the cost, though benchmark caveats apply.
- Sources (4)
-
-
“Dohan et al. frame chain-of-thought, verifiers, tool use, and selection-inference as probabilistic programs composed from language models, providing a compositional foundation for cascades and routers.”
Dohan et al., Language Model Cascades background -
“Chen et al. propose FrugalGPT, which learns which model combination to call for each query and reports up to 98% cost reduction compared with using GPT-4 alone while preserving accuracy.”
Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance direct -
“Ong et al. train routers on human preference data and show strong transfer to new model pools, suggesting learned routers can adapt as the underlying model landscape changes.”
Ong et al., RouteLLM: Learning to Route LLMs with Preference Data direct -
“Hu et al. introduce RouterBench, a standardized dataset and evaluation framework for comparing multi-LLM routing strategies on accuracy-cost tradeoffs.”
Hu et al., RouterBench: A Benchmark for Multi-LLM Routing analogous
-
- Counterpoints (2)
-
-
Reported cost reductions are benchmark-specific; real-world savings depend on query distribution, model pricing, and latency requirements that change frequently.
-
RouterBench and similar evaluations primarily measure accuracy and cost, often underweighting reliability, policy compliance, and latency.
-
Test-time search strategies—self-consistency, Tree of Thoughts, Reflexion, and multi-agent debate—expand what a control plane can spend compute on at runtime.
- Sources (4)
-
-
“Wang et al. show that self-consistency, which samples multiple reasoning paths and takes a majority vote, improves chain-of-thought reasoning without changing the underlying model.”
Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models direct -
“Yao et al. generalize chain-of-thought to Tree of Thoughts, where an explicit search tree over reasoning steps enables backtracking and lookahead.”
Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models direct -
“Shinn et al. introduce Reflexion, which uses linguistic feedback and an episodic memory buffer to let agents improve during a task without gradient updates.”
Shinn et al., Reflexion: Self-Reflective Agents with Dynamic Memory direct -
“Du et al. demonstrate that multi-agent debate, in which multiple LLM instances argue over rounds, can improve factual accuracy and reasoning.”
Du et al., Improving Factuality and Reasoning in Language Models through Multiagent Debate direct
-
- Counterpoints (2)
-
-
Each strategy multiplies compute usage through sampling, search, reflection loops, or debate rounds, increasing latency and cost in ways that may not be justified for simple queries.
-
Performance gains are task-dependent; self-consistency and debate can degrade results when the majority of sampled or argued answers is systematically wrong.
-
Recent learned orchestrators such as Sakana Fugu, Trinity, and Conductor are signals of automated scaffold generation, not settled production recipes.
- Sources (4)
-
-
“Tang et al. train Sakana Fugu orchestrator models to understand a query and dynamically generate agent teams and scaffolds, reporting strong results across coding, reasoning, and exam benchmarks.”
Tang et al., Sakana Fugu: Orchestrator Models for Adaptive Agentic Scaffolds direct -
“Xu et al. propose Trinity, which uses a small coordinator to assign Thinker, Worker, and Verifier roles among LLMs and harmonize them as a single mind.”
Xu et al., Trinity: Harmonizing Multiple Large Language Models as a Single Mind direct -
“Nielsen et al. train Conductor with reinforcement learning to design communication topologies and prompt workers in multi-agent systems, supporting recursive self-selection.”
Nielsen et al., Conductor: Learning to Orchestrate LLM Agents via Reinforcement Learning direct -
“Zhang et al. use Monte Carlo Tree Search over code-represented workflows to automate agentic workflow generation, an earlier example of learned scaffold design.”
Zhang et al., AFlow: Automating Agentic Workflow Generation analogous
-
- Counterpoints (2)
-
-
Fugu, Trinity, and Conductor are preprints or very recent papers; their benchmark results have not been independently reproduced and may reflect task leakage or cherry-picking.
-
Adding an orchestrator model introduces its own latency, cost, failure modes, and training-data biases, which are rarely reported at production scale.
-
Production control planes combine routing, fallback, policy, memory, evaluation, observability, and lifecycle governance, but no single vendor owns all of them.
- Sources (3)
-
-
“LiteLLM provides a gateway with unified API, model routing, fallback, caching, rate limiting, spend tracking, and observability callbacks.”
LiteLLM documentation direct -
“LangGraph offers a graph-based runtime with state, handoffs, tool orchestration, persistence, and human-in-the-loop, covering framework-level control-plane concerns.”
LangGraph documentation (LangChain) direct -
“Conductor shows how reinforcement learning can automate coordination topologies, illustrating a learned-routing capability that complements static gateway rules.”
Nielsen et al., Conductor: Learning to Orchestrate LLM Agents via Reinforcement Learning analogous
-
- Counterpoints (2)
-
-
Gateways like LiteLLM know little about an agent's internal reasoning, while frameworks like LangGraph usually operate within a single codebase and lack enterprise multi-tenancy and billing.
-
Enterprise platforms bundle governance, connectors, and approval workflows but often trade flexibility for vendor lock-in, so no single product covers all seven capabilities well.
-
The term 'control plane' is contested across vendors; teams should judge products by concrete capabilities rather than marketing labels.
- Sources (3)
-
-
“LiteLLM is marketed as an open-source LLM proxy and gateway, emphasizing unified API, routing, and spend management rather than a full enterprise control plane.”
LiteLLM documentation direct -
“LangGraph is positioned as a framework and state machine for building agents, overlapping with but distinct from gateway-style routing and policy enforcement.”
LangGraph documentation (LangChain) direct -
“Sakana Fugu frames control as learned scaffold generation, a different capability emphasis from routing, policy, or observability.”
Tang et al., Sakana Fugu: Orchestrator Models for Adaptive Agentic Scaffolds analogous
-
- Counterpoints (2)
-
-
The term is contested partly because the category is nascent; a canonical definition or standard taxonomy may emerge as the market matures.
-
Individual vendors may use the term consistently inside their own product taxonomies even if their definitions differ from one another.
-
Teams should treat model selection, fallback, observability, and policy enforcement as infrastructure concerns rather than per-agent code.
- Sources (4)
-
-
“FrugalGPT shows that model selection can be learned and centralized into a cascade, rather than hard-coded for each query in application logic.”
Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance direct -
“RouteLLM demonstrates that a router trained once can transfer to new model pools, suggesting model selection can be maintained as shared infrastructure.”
Ong et al., RouteLLM: Learning to Route LLMs with Preference Data direct -
“LiteLLM centralizes routing, fallback, retry, spend tracking, and observability callbacks behind a single proxy API.”
LiteLLM documentation direct -
“RouterBench provides a common evaluation framework for routing strategies, supporting the treatment of model selection as a reusable infrastructure component.”
Hu et al., RouterBench: A Benchmark for Multi-LLM Routing analogous
-
- Counterpoints (2)
-
-
Centralizing these concerns creates a new single point of failure and a potential latency bottleneck that must be engineered for resilience and observability.
-
Simple agents with stable, narrow task boundaries may not benefit enough from a separate control plane to justify the added operational complexity.
-
Review recordHow this was madeShow detailsHide details
Created 2026-06-26 by human.
Policy: policy:default v1.0.0.
✓ Approved hash matches current article
Agent runs
- draftingkimi2026-06-26
in:852663a4…out:031a91bb…
Reviews
- agentapproved2026-06-26
Scope: claims, tone, privacy, scope
contentHash:
98f24b29680f350d…Sibling-agent review against article-proposal-ideation eval-card. Privacy scan passed. No proprietary or personal content detected.
- humanapproved2026-06-26
Scope: thesis, scope, tone
contentHash:
98f24b29680f350d…Human author approved the draft for website publication.
Machine-readable files
The same points, sources, and relationships are also available as structured files for agents and tools. The JSON follows the publication record schema.