Paul StagnerMay 28, 202611 min read

Orchestrating Coding Agents at Scale: Lessons from ClusterClaw

A lone agent solving a ticket is a demo. A coordinated fleet that ships, verifies, and operates real systems is a product. Here is how ClusterClaw turns Claude Code, Codex, OpenClaw and Hermes into dependable infrastructure.

AI Orchestration

ClusterClaw

Claude Code

Kubernetes

The single biggest unlock of the last two years was a reframing: a coding agent is not autocomplete, it is a worker. And workers do not need cleverness so much as they need orchestration — queues, supervisors, retries, observability, and an unambiguous contract for what "done" means.

Once you treat agents like infrastructure, the disciplines that make distributed systems reliable are exactly the disciplines that make agent fleets reliable. That insight is the foundation of ClusterClaw, the active Kubernetes and OpenShift SRE agent I build and operate. ClusterClaw started as a Kubernetes-first fork of OpenClaw, and the whole point is the same: take raw model capacity and turn it into observable, dependable engineering throughput.

One agent is a demo. A fleet is a product.

A single agent fixing a ticket looks great in a screen recording and falls apart in production. The interesting engineering begins when you run many agents concurrently against a real estate of clusters and codebases — each scoped to a slice of work, each reporting back through a consistent protocol.

I route work to whichever model is strongest for the task instead of pledging allegiance to one: Claude Code for deep multi-file reasoning, Codex for tight iterative loops, OpenClaw and Hermes for specialized and self-hosted paths. The orchestrator does not care which brain answers; it cares that the answer is verifiable.

$Scope each agent to a bounded task with explicit acceptance criteria.
$Make every agent emit structured progress, not just a final diff.
$Supervise relentlessly: detect stalls, kill hung runs, retry with more context.
$Gate every change behind the same CI a human PR must pass — no exceptions for robots.

What ClusterClaw actually does

ClusterClaw is an active SRE agent, not a chatbot bolted onto kubectl. It spins up clusters, manages workloads, performs upgrades, troubleshoots incidents, and provides round-the-clock operational support across every major platform — AKS, EKS, GKE, ARO, ROSA, and self-managed OpenShift.

Because it is CLI-first, it slots into the same automation surfaces engineers already trust. The agent is constrained to a well-defined operational vocabulary rather than allowed to improvise against a live control plane.

$Cluster lifecycle: spin-up, scaling, node management, Kubernetes 1.31.x and OpenShift 4.17.x upgrades.
$Troubleshooting: pod failures, networking, storage, and performance diagnostics.
$Security: RBAC, NetworkPolicies, Pod Security Standards, and Trivy vulnerability scanning.
$GitOps: ArgoCD, Flux, Kustomize and Helm workflows as first-class citizens.
$Backup and restore via Velero, plus production-ready manifest generation.

A minimal supervisor loop

The pattern that has held up best is a thin supervisor that owns the lifecycle while the model owns the thinking. The supervisor never trusts the agent blindly: it streams progress, nudges on stalls, retries on failure, and refuses to merge anything CI has not blessed.

typescript

async function runAgent(task: Task) {
  const run = await spawn(task.model, task.prompt)
  for await (const event of run.stream()) {
    if (event.type === "stall") await run.nudge(moreContext(task))
    if (event.type === "error") return retryWithBackoff(task)
  }
  const change = await run.result()
  // The same gates a human PR must clear — lint, types, tests, policy.
  return ci.verify(change) ? propose(change) : escalate(task, change)
}

Why verification is the whole game

Models will keep getting better; that is not where the durable value lives. The value lives in the supervision, routing, evaluation, and guardrails you build around them. An agent that can open a pull request is interesting. An agent whose pull requests are indistinguishable from a senior engineer’s — because they pass the same gates and carry the same evidence — is transformative.

For an SRE agent the stakes are higher still: a wrong answer can take down a cluster. So ClusterClaw treats every destructive operation as a proposal that must clear policy checks, dry-runs, and (where configured) human approval before it touches production.

The takeaway

Orchestration is the product. Treat agents as a fleet, give them a supervisor, make everything observable, and never let a robot bypass the gates your humans live by. Do that, and you stop demoing AI and start operating it.

back to all posts