May 15, 2026 · Anton Grishko

AI Dark Factory: A 15-Persona Multi-Agent System on the Model Context Protocol

A 15-persona multi-agent system on the Model Context Protocol (MCP): six tiers, hierarchical with parallel fan-out, Tier-0 router that makes the bypass observable. With benchmarks from the 2025–2026 literature and concrete code from the Kuberly platform.

A coordinated team of AI agent personas, running inside a single MCP server, that turns one user request into a verified change, a cited diagnosis, or an audited refusal — without ever needing fourteen people in a Slack channel.

TL;DR

We built an AI Dark Factory on top of the Kuberly MCP server: fifteen specialist personas across six tiers, hierarchical with parallel fan-out, designed around three properties — token-aware, parallel by default, bypass observable. Every non-trivial request enters the factory through a Tier-0 router that classifies it into one of four cost classes: direct-tool, direct-workflow, single-persona, or factory. Most requests stop at the cheap classes and never load another persona prompt. The smaller fraction that genuinely needs orchestration gets the full lifecycle — plan, dispatch, execute, review, reconcile — fanned out wide and joined on a per-session filesystem. No persona pushes code; humans approve every mutation. This is what the 2025–2026 multi-agent literature describes as best practice, packaged as something a platform team can actually ship.

This is a long post. Skim the diagrams; the figures carry as much information as the text.

AI Dark Factory tiers

Figure 1 — The factory floor: six tiers, fifteen personas. Each row is one tier; each dot is one persona. Token cost is dominated by what you don't load.

Why this exists

Building AI agent systems in 2024 was easy mode: pick a base model, pick a tool surface, ship. By late 2025 it had become unmistakable that single-agent setups hit a wall on real DevOps work — anything that spanned planning, implementation, observation, and review at the same time. The instinct of every platform team was the same: bolt on a second agent. Then a third. Then, suddenly, somebody had committed a fourteen-agent system to main and nobody could explain what half of them did.

What hadn't been agreed on — and still isn't, judging by 2026 conference talks — was how the agents should talk to each other. Sequentially? Through a message bus? Through a shared scratchpad? Should one of them be a manager? Should the manager be allowed to run tools, or only delegate? And how do you keep a multi-agent system from costing 15× single-agent for the same work?

The Kuberly platform answers those questions opinionatedly. This post is the long-form version of why, with citations. If you're shipping multi-agent systems in 2026 and you don't want to relitigate every design decision from scratch, the next forty minutes will probably save you a quarter.

The research, before the design

Before we built anything, we read the room. Six pieces of work shaped the design more than anything else, and three numbers from them appear repeatedly in this post:

Anthropic, "How we built our multi-agent research system" (Jun 2024, still canonical in 2026). The single most-cited number in the multi-agent debate: their multi-agent runs cost roughly 15× the chat tokens of a vanilla user-assistant conversation and 3.75× the tokens of a single-agent baseline doing the same work. Their own lead agent's prompt explicitly says: "Simple fact-finding requires just 1 agent with 3–10 tool calls." The multi-agent path is reserved for high-value work where the spend is justified. Source.
AWS Bedrock Multi-Agent Collaboration (GA March 2025). AWS shipped two modes side by side: full supervisor (every request goes through the orchestrator) and Supervisor with Routing, where simple requests bypass orchestration entirely and only escalate on complexity. The bypass mode wasn't a workaround — it was a first-class feature, named, documented, and made the default for cost-sensitive workloads. Source.
OpenAI Agents SDK (March 2025, replaces Swarm). The "triage agent" pattern formalized across the SDK: a triage agent routes but does not answer. Usually runs on a cheap mini model (roughly 10× cheaper than the frontier router). The decision artefact — which specialist handles this — is explicit and inspectable; it isn't buried in the orchestrator's prompt.
Cognition / Devin, "Don't Build Multi-Agents" (2025). The Anti-Anthropic position. The argument, paraphrased: multi-agent setups produce fragile systems from poor context sharing; single-threaded agents get you surprisingly far. Important counterweight. The conclusion we drew: when you DO build multi-agent, optimize for context sharing, not for tool fanout. The session filesystem in the Dark Factory is a direct response to this critique.
CrewAI hierarchical mode docs + community write-ups. Real production numbers from a widely-deployed framework: the manager persona adds 30–50% extra token cost over sequential execution on a five-task crew, and in practice the manager often degenerates into sequential execution anyway. Even "lightweight" orchestration is not free.
arXiv 2605.03310 (May 2026, "Why multi-agent LLM systems fail"). Empirical study of production multi-agent deployments. Failure rate: 41–87%, the authors writing that the failures are "mostly due to coordination defects rather than base-model capability." Every extra orchestration hop you add to a trivial query is buying coordination risk for zero gain.

What these six say in chorus is unambiguous:

Multi-agent works, but the value of the task has to justify the multiplier.
The bypass is mandatory, not optional. Pretending every request needs the orchestrator is how you 4× your spend.
The orchestrator's decision must be inspectable. "The manager will figure it out" is not an architecture; it's a bug report waiting to happen.
Context sharing is the hard problem, not parallelism.

Bypass vs unify in the 2025-2026 literature

Figure 5 — What the 2025–2026 multi-agent literature actually says. Anthropic's ~3.75× token multiplier and arXiv 2605.03310's 41–87% coordination-failure rate are the two numbers that anchor the design.

What the Kuberly MCP actually is

The factory rides on top of the Model Context Protocol server we run for every Kuberly customer. The MCP server is the gateway between an AI agent (Claude Code, Cursor, internal IDEs, headless agents) and a customer's cloud infrastructure. It does four things:

Identity and scoping. GitHub OAuth establishes who the human is; every tool call is then scoped to organizations and clusters that human has access to. There's no protocol-level path for an agent to read a customer's data without that customer having explicitly authorized them.
Discovery. Tools like list_orgs, list_clusters, and get_cluster give the agent the "cluster handle" before it tries to do anything. get_cluster returns the capability map of that cluster — what tools, what prompts, what proxy targets the downstream MCP exposes.
Two execution paths. Either the server runs the tool itself (the 7-tool read-only AWS observability surface — CloudWatch logs, metrics, alarms, CloudTrail) or it proxies the call into the cluster's own MCP server (call_cluster_tool / call_cluster_prompt), where things like kubectl_get_pods, Loki log queries, Prometheus metric queries, and custom Kuberly cluster ops live.
Knowledge workflows. Markdown documents fetched on demand. The set covers cluster-discovery conventions, large-result caching, the AWS observability surface, CI wiring patterns, the factory's own map, and one document per specialist persona. Agents don't memorize procedure; they fetch the document and follow it.

That's it for the MCP layer. Seventeen tools, one HTTP server, OAuth scoping, no magic. Everything else in this post is built on top of that surface — the personas are pieces of markdown the agent fetches and dispatches as subagents on its own host. The server is stateless; the personas write their work products to the client's filesystem.

This is important: the Dark Factory is a pattern, not a runtime. Nothing in the MCP server runs a persona. The pattern works because the manager (Claude Code or equivalent) has the agent-spawning primitive on its own side; the MCP just hands out the prompts.

The factory floor

Fifteen personas, six tiers. The numbers matter — they were chosen, not stumbled into. Five tiers of work; one zero-th tier whose only job is to decide whether the other five fire at all.

Tier 0 · Router

The factory's front door. Runs on every non-trivial request before any other persona prompt is loaded. It classifies the request into one of four verdicts and writes them to routing.md:

verdict	what the manager does	examples
`direct-tool`	calls one MCP tool, answers the user	"list log groups in prod", "is loki running in cluster X"
`direct-workflow`	fetches one `get_workflow` markdown, returns content	"how do I wire CI", "explain the cluster discovery flow"
`single-persona`	loads one persona prompt, dispatches once	live incident → `sre`; small diff → `pr_reviewer:in-context` only
`factory`	runs the full lifecycle (steps 3–11)	cross-repo refactor with mutation; multi-cluster change

The router has a hard cap of three MCP tool calls and ideally zero — it's pure classification from the request text, not investigation. Its prompt is ~80 lines.

If you're wondering whether this is just the manager's old routing decision tree wearing a hat: yes, exactly that, and the choice to lift it out was deliberate. Putting it inside the manager's prompt made the routing decision live in the manager's reply text — observable to whoever was reading the chat that turn, invisible afterwards. Lifting it into a persona puts the decision in a markdown file with a defined schema, makes it auditable months later, lets us tune the routing rules without rewriting the manager's prompt, and matches the pattern AWS Bedrock and OpenAI Agents SDK ship in production.

Routing flow with four verdicts

Figure 2 — Routing flow. The router writes routing.md with one of four verdicts; the manager acts on it. ~80% of real-traffic requests stop at the cheap verdicts and never load another persona.

Tier 1 · Coordination

The shop-floor managers.

manager is the apex orchestrator. Never edits code, never runs shell commands against infrastructure. Owns the session, decides the dispatch DAG, fans out subagents concurrently, synthesizes results, decides which findings turn into follow-up work, hands off to the user.
triage classifies an incoming factory-class request into one of five sub-classes: change, incident, question, review, postmortem. Runs first on every factory-verdict request and writes triage.md with the recommended persona set. Most factory runs use 2–4 of the available personas, not the full 14.
dispatcher reads scope.md and the triage decision; emits dispatch.md — an ordered DAG of next-phase personas with explicit parallel: true / false flags. The manager fires the parallel-safe steps in one assistant message.

Tier 2 · Planning

The design room. Read-only, always.

planner turns a vague request into a precise scope.md — affected clusters, downstream MCP capabilities, AWS observability surfaces touched, and (critically) an explicit out-of-scope fence. The fence is what stops scope creep mid-run.
architect proposes the change shape — option set, chosen approach, interfaces touched, migration order. Writes design.md.
risk_assessor does the blast-radius walk: dependency analysis, rollback plan, OpenSpec / change-management gates. Writes risk.md.

These three are the canonical Phase 2 parallel triple. The manager dispatches them in one assistant message — three concurrent subagent calls — and the slowest one sets the duration.

Tier 3 · Implementation

The build floor. The only tier that mutates state.

iac_developer implements infrastructure-as-code edits (Terraform / OpenTofu / CUE / JSON / Helm) in the customer's infra repo. Verifies with pre-commit + terragrunt hclfmt + tflint. Never runs apply / init / plan itself.
ci_cd_engineer handles GitHub Actions / CodeBuild work — bootstrap, modify, troubleshoot. Operates across the infra repo AND the app repo simultaneously. Pins reusable workflows by tag or SHA, never @main. The persona's first move on any GHA work is to fetch the github_reusable_ci workflow document — the canonical contract for the reusable workflow's inputs and secrets lives there, not duplicated inside the persona prompt.
k8s_operator runs cluster-side mutations through call_cluster_tool (kubectl apply, scale, restart, label). Refuses anything not pre-approved by decisions.md.

Tier 4 · Observation

The monitor wall.

sre does incident diagnosis. Crucially, the SRE pulls signals from wherever the customer keeps them: AWS-side (aws_cwlogs_*, aws_cwmetrics_*, aws_cwalarms_*, aws_cloudtrail_lookup) and in-cluster via the downstream MCP (call_cluster_tool for Loki / Prometheus / Tempo / kubectl_get_events / whatever the cluster exposes). Different customer architectures push signals to different places — EKS pods log to Loki, ECS/Lambda to CloudWatch, on-prem k8s only through the cluster MCP, hybrid setups split across both surfaces. The persona discovers the cluster's capabilities.tools first and pulls from both sources in parallel when relevant. Read-only on infra. Writes diagnosis.md with cited evidence rows.
cost_analyst is the pre-deployment FinOps gate. Estimates the cost delta of the proposed change — instance class, replicas, log retention, NAT traffic, S3 lifecycle. Writes cost.md. Read-only.

Tier 5 · Quality gates

The QA bench.

pr_reviewer reviews the diff. Two invocations run in parallel: in-context reads scope/decisions/design (the rationale); cold deliberately ignores them (its value is the absence of rationale — a fresh reader's perspective). Writes findings/in-context.md or findings/cold.md.
security_auditor does IRSA/IAM-expansion review, secret-in-code scan, public-exposure check, CVE delta on bumped images. Writes security.md.
findings_reconciler merges every findings/*.md plus security.md and cost.md into one decision-ready list — deduped, prioritized, and citing every discarded finding so nothing disappears silently. Writes findings/reconciled.md.

Parallel by default

Sequential persona dispatch is the failure mode of every AI orchestrator we've watched fail in production. Our design assumption is the opposite: fan out wide, synthesize at the join.

The dispatcher persona labels every step parallel: true (no dependency on a sibling's output) or parallel: false (needs <file>). The manager fires all parallel-safe steps for a phase in a single assistant message with multiple Agent calls. Two canonical parallel rounds the factory uses repeatedly:

Phase 2 planning: planner + architect + risk_assessor (+ sre if the request is an incident) — four subagents in one message.
Phase 5 review: pr_reviewer:in-context + pr_reviewer:cold + security_auditor + cost_analyst — four subagents in one message.

Locking is unnecessary because each persona writes exactly one assigned file. The session filesystem is the join — no message bus, no locks, no race conditions. Sequential dispatch is allowed only when a downstream step must read an upstream file (e.g. findings_reconciler after pr_reviewer returns).

A round that returns N personas in parallel is one assistant turn, not N turns. Latency on a typical change-class run goes from "seven sequential turns" to "four turns, where the planning and review phases happen as single fan-outs." On real DevOps work that's a 2–3× wall-clock reduction over a naive supervisor-dispatches-each-step approach, and the cost stays linear in unique personas dispatched, not in turns.

Parallel fan-out vs sequential

Figure 3 — Parallel fan-out vs sequential dispatch. Same work; one turn per phase versus one turn per persona. The session filesystem is the join — no locks, no message bus.

The session filesystem

We took the Cognition critique seriously: context-fragility from poor sharing. The Dark Factory's response is that personas do not share context through a message bus or a structured protocol; they share it through a per-task session filesystem that the client (not the server) creates.

.kuberly/factory/<session-slug>/
├── context.md            manager — request, target cluster(s), constraints, trust phase
├── routing.md            router — verdict + target + rationale
├── triage.md             triage — request class + recommended persona set
├── scope.md              planner — affected resources, out-of-scope fence
├── design.md             architect — option set + chosen approach
├── risk.md               risk_assessor — blast radius, dependency walk, rollback
├── decisions.md          manager — irreversible calls + reasons
├── dispatch.md           dispatcher — ordered persona DAG with parallel flags
├── diagnosis.md          sre (incidents only)
├── cost.md               cost_analyst (mutations only)
├── security.md           security_auditor (mutations only)
├── findings/
│   ├── in-context.md     pr_reviewer (in-context)
│   ├── cold.md           pr_reviewer (cold)
│   └── reconciled.md     findings_reconciler
└── tasks/
    └── <NN>-<slug>.md    manager — implementation prompts for Tier 3 personas

The rules are short and load-bearing:

Read rule. Every persona may read every file in the session directory.
Write rule. Every persona writes only its assigned file. No two personas write the same file, so there's nothing to lock.
Exception. The cold pr_reviewer deliberately ignores context.md / scope.md / decisions.md / design.md. Its value to the reconciler is the absence of rationale. If we let it see the in-context information it'd just reproduce the in-context reviewer's output.
Server is stateless. The Kuberly MCP server does NOT persist this filesystem. It lives next to the developer's working copy at .kuberly/factory/<session>/, is .gitignored by convention, and goes away when the session ends. Nothing in the server has to know about session state.

This solves the context-sharing problem without inventing a protocol. The manager's "look up what risk_assessor wrote about the database column drop" is cat .kuberly/factory/<session>/risk.md. There is no API. There is no schema beyond markdown headings. The whole apparatus can be inspected, grepped, version-controlled by hand, copied into Slack, or pasted into a postmortem.

Session filesystem layout

Figure 4 — The per-session filesystem. Each persona writes exactly one file; everyone reads everyone else. The Kuberly MCP server is stateless about this — the directory lives on the developer's machine and is .gitignored by convention.

Just-in-time persona loading

There are fifteen personas now. Each persona prompt is ~50–150 lines of markdown — call it 1–3k tokens. Loading all fifteen on every run would burn 30–45k tokens at the start of every session for nothing.

The factory never does that. The manager loads only one persona eagerly: the router — small, fast, runs on every non-trivial request, prevents the manager from loading anything else when the request doesn't need it. The router's prompt is 77 lines.

Everything else is fetched the moment the manager is about to dispatch it, via get_workflow. The triage persona is loaded after the router returns verdict: factory. The planning trio (planner, architect, risk_assessor) is loaded after the triage class is known, in one round, dispatched in one message, then released. The review quartet (pr_reviewer ×2, security_auditor, cost_analyst) is loaded after a mutation pass completes.

A typical change-class run loads 8 persona prompts across the lifecycle, not 15. A typical incident-class run loads 3 (router, sre, optionally findings_reconciler if there's a fix to review). A typical direct-tool verdict loads 1 (just the router). The aggregate cost over a week of mixed traffic comes out close to single-agent baseline, not 3.75× — the savings come from never loading what isn't needed, not from making each persona cheaper.

Trust phases

Production AI agent teams in 2026 ramp autonomy in four phases. The Dark Factory supports all four — the manager simply changes how much it surfaces to the user. The phase is stated in decisions.md on every run, so an auditor reading the session weeks later sees exactly which level of autonomy was in effect.

Shadow. Personas run; the manager surfaces every output for review; no mutation is executed.
Recommend. Personas run; manager presents a concrete plan and waits for an explicit user apply.
Execute-with-approval (default). Manager auto-runs read-only and read-mostly personas; pauses for explicit approval on iac_developer / ci_cd_engineer / k8s_operator.
Full auto. Manager runs the full DAG; user is paged only on findings_reconciler blockers or persona refusals.

Cross-cutting hard rules that apply at every phase — these don't relax even at Full auto:

No git push, no PR creation, no PR merge from any persona. Reviewers read the diff; merging is human.
No terragrunt apply / tofu apply / tofu destroy / kubectl delete <namespace|crd|...> from any persona.
Personas cannot spawn subagents. Recursive dispatch is the manager's job alone.
Read-only personas surface, not fix. Tiers 2 (planning), 4 (observation), 5 (quality) describe; Tier 3 (implementation) mutates.
Tool-use cap = 12 per persona. Going over means re-scope (split, drop, or downgrade to plain Explore mode), not "be thorough."

Trust phases

Figure 6 — Four trust phases. The manager states the active phase in decisions.md on every run, so an auditor reading the session weeks later sees exactly which level of autonomy was in effect.

A walkthrough

Concrete is clarifying. Here are four real request shapes and what the factory does with each one.

Cheap-path: "list log groups in prod"

The manager checks the cluster pre-flight, calls the router, the router writes:

verdict: direct-tool
target: aws_cwlogs_list_log_groups
cluster_needed: true
rationale: "list log groups in prod" — single MCP call answers it.

The manager calls aws_cwlogs_list_log_groups, formats the response, returns. Total: 1 cluster pre-flight, 1 router dispatch, 1 AWS call. No specialist persona was loaded. The router's marginal cost on this kind of request is roughly the cost of a Bash echo — but the routing decision is now in a file, auditable.

Docs-path: "how do I wire my app repo's CI to Kuberly?"

verdict: direct-workflow
target: github_reusable_ci
cluster_needed: false
rationale: how-do-I question; the workflow doc has the canonical YAML.

Manager fetches github_reusable_ci via get_workflow, returns the document. The ci_cd_engineer persona is NOT loaded — there's no work to do, just a reference to read. This is the case where keeping the bypass cheap matters most: documentation lookups are common, they shouldn't pay a multi-agent premium.

Single-persona: "auth-service is crashlooping in prod"

verdict: single-persona
target: sre
cluster_needed: true
rationale: live incident, pure diagnosis. Skip planner/architect/risk_assessor.

The manager loads the sre persona prompt, dispatches one subagent with the incident description and the resolved cluster handle. The SRE persona pulls in parallel from both AWS observability and in-cluster signals (kubectl_get_events, Loki, cwmetrics_get_data for the ALB), writes diagnosis.md with cited evidence, returns a Recommended next persona hint. Manager surfaces the diagnosis and stops. Total persona prompts loaded: 2 (router + sre).

Full factory: "introduce a new service tier across the cluster and wire its deployment"

The router returns verdict: factory, class: change. The manager dispatches triage for confirmation, then fires the Phase 2 planning trio in one message: planner + architect + risk_assessor. They write their three files concurrently. The manager reads all three, runs dispatcher → dispatch.md. The dispatch DAG calls for the implementation tier: iac_developer for the infrastructure-as-code edits, ci_cd_engineer for the CI wiring across infra and app repos, and k8s_operator for any cluster-side mutations that need to happen alongside (often none — most of the work is in the infra and app repos). The manager surfaces the plan for explicit approval (trust phase 3). User approves.

Implementation runs. Then the Phase 5 review quartet fires in one message: pr_reviewer:in-context + pr_reviewer:cold + security_auditor + cost_analyst. Each writes its file. Manager runs findings_reconciler. If reconciled.md says Verdict: clean, the manager hands off. If not, the manager turns must-fix findings into tasks/<NN>-<slug>.md prompts and re-dispatches the implementation personas. Re-run review until clean.

The user sees: a plan to approve, an "implementing now" turn, a "here's the reconciled review, four issues to look at" turn, an "ok fixed those, re-reviewing" turn, a "verdict: clean" turn, and a final summary. Five user-visible turns for what was a fourteen-step lifecycle internally. The session filesystem holds the rest.

What we actually shipped

The factory ships as part of the Kuberly MCP server. Concretely:

15 persona prompts packaged with the server binary and fetched on demand via a single read-only "give me workflow X" API. Persona prompts are markdown — short, structured, prescriptive — so they can be reviewed, diffed, and version-controlled exactly like any other artefact.
1 master factory document that contains the factory floor map, the lifecycle, the dispatch rules, and the tool surface. Fetched once at the start of every multi-step session; never preloaded otherwise.
A small set of cross-cutting workflow documents covering the cluster-selector contract, large-result caching, AWS observability conventions, GitHub Actions CI wiring, and the factory's own overview. These are pattern references the manager consults before dispatching specialists.
One tool resolves them all. Both the static workflow names and the per-persona form go through the same API, with tests asserting every registered name round-trips through the tool so the registry cannot silently drift.
A Tier-0 router persona that runs on every non-trivial request and writes its verdict to the session filesystem. Small prompt, ideally zero tool calls per dispatch.
Integration tests that exercise the full surface across the real MCP transport — no unit-test shortcuts on protocol behavior.

What's not in the server:

A persona runtime. Personas are pieces of markdown the manager dispatches as subagents using the client's own Agent-spawning primitive (Claude Code, Cursor, etc.). The Kuberly MCP doesn't run agents; it serves the prompts.
The session filesystem. The client creates it. The server is stateless about session state.
Session retention. When the developer closes their IDE the directory is just there, until they rm it.

This is on purpose. The pattern is portable to any host that has a subagent primitive and a way to fetch markdown over HTTP. We chose Claude Code because its Agent tool + parallel-calls-in-one-message + Read/Edit/Write/Bash toolset is exactly what the manager needs. If a future host swaps in, the pattern doesn't change — only the dispatch syntax does.

Lessons we accumulated along the way

A handful of design decisions that look obvious in hindsight and were anything but at the time.

Don't make the orchestrator a persona. Tempting design: have the manager be one of the dispatchable personas, with its own prompt, fetched like the others. The trap is that the orchestrator has a different kind of knowledge from the specialists — it owns the dispatch primitives, the parallel-fan-out rules, the trust-phase rules. Making it a fetchable persona either leaks orchestration knowledge into every other persona (recursive dispatch chaos), or carves out a "special persona" — which is exactly what an orchestrator already is. The manager stays as the client's top-level agent; everything else is fetchable.

Read-only is the default; mutation is the exception. Eleven of the fifteen personas cannot change state. Including the SRE. Especially the SRE. Diagnosis without authority to fix is the whole point — the diagnosis goes to iac_developer or k8s_operator who run inside the trust-phase gate. Untangling "find the cause" from "apply the fix" was the single biggest reduction in fan-out anxiety we measured.

Persona prompts are short and prescriptive, not philosophical. Every persona prompt has the same shape: Inputs / Hard rules / Output (the assigned file) / Done. ≤150-word reply rule. ≤12 tool calls per persona. We deleted three rounds of "philosophical guidance about good engineering" from each prompt. They were padding; the rules carry the work.

The cold reviewer is not a redundant pass. Running pr_reviewer:in-context and pr_reviewer:cold in parallel felt wasteful in the first review. By the third we were finding things in the cold pass that the in-context pass had explained away. The two reviewers are not the same review run twice; they're two different reviews — one with rationale, one without. The findings_reconciler merging them is where the real signal lives.

Token cost is mostly about what you don't load. The intuitive savings target is each persona — make every prompt shorter, every reply cap stricter. That helps, but it's not where the money lives. The money lives in the router's direct-tool and direct-workflow verdicts short-circuiting the majority of requests before any specialist persona is loaded. Cheap-path traffic is the bulk of traffic. The factory is most valuable on the smaller fraction of requests that genuinely need orchestration.

The session filesystem turned out to be the design's load-bearing column. Every alternative we considered — message queues, structured outputs, supervisor aggregators, shared scratchpads with locks — was strictly worse on at least one axis (latency, complexity, observability, replay-ability). The plain-markdown-on-disk approach has none of those costs and gets you git diff and grep for free.

Where this goes next

A handful of natural extension axes the design already accommodates:

More specialist personas. The factory floor has room for additional specialists — incident postmortem writing, multi-cluster orchestration, budget-aware routing — each fitting the existing tier model. Adding personas is template work, not architecture work.
Cost-aware routing. The router classifies by intent today. A natural addition is classification by budget — "this session has burned its token allowance; downgrade everything one tier." The plumbing is in place; what's missing is the policy.
Persona-authoring tooling. Manual authoring of a dozen-plus prompts is tractable; manual authoring of three dozen is not. A wizard that takes a description of new specialist work, generates a persona prompt matching the factory's house style, and registers it with the tool surface is a force-multiplier worth building.

What's intentionally not on the roadmap, and probably won't be:

A persona that writes more personas. Persona authorship is design work, not labor. A wizard that helps a human design is fine; an agent that designs new agents introduces failure modes the literature is unambiguous about.
Autonomous merge. The factory stops at "verdict: clean" and hands off. git push, PR creation, and merge are human steps. The time savings from auto-merge are marginal; the compliance and audit risks are not.

Closing

The AI Dark Factory is what happens when you take the 2025–2026 multi-agent literature seriously instead of trying to invent a new pattern in public. The shape — hierarchical with parallel fan-out, JIT prompt loading, file-based join, explicit bypass — is what every production-quality publication converges on. The contribution isn't the shape; it's making the shape something a platform team can ship.

If you're doing similar work and want to compare notes, reach out. If you're shipping a multi-agent system to production right now and you're tempted to skip the bypass step: please read Anthropic's engineering post before you commit. That number — 3.75× tokens compared to single-agent — does not go away because your specialists are well-prompted. It's the cost of orchestration. Make the cost observable and you can manage it. Hide it inside the manager's prompt and you can't.

We chose to make it observable. Six months in, we don't regret it.