Chapter I · AI agent stack

Multi-Model Orchestration

Tested on
OpenAI Pro ($200/mo Codex subscription), OpenClaw built-in image generation with gpt-image-2, Codex CLI harness subagents, Claude Code via tmux relay and ACP compatibility, browser-LLM stack via Playwright + noVNC, Ollama local GPU, Ollama Pro cloud models
Last updated
2026-06-05

How to run multiple AI models in one OpenClaw setup, assign each to the right task tier, and stop burning expensive tokens on work that doesn’t need them.

Why Multi-Model Matters

Running one model for everything is like hiring a senior architect to answer phones. Your orchestrator needs to be strong enough to handle ambiguity and adversarial input. Everything else can run cheaper or free.

This isn’t about saving money. It’s about using the right tool for each job. A local embedding model handles memory and code retrieval better than a frontier chat model wasting quota on vector work. Browser-driven LLMs handle research and UI-only workflows, while OpenClaw’s built-in image generation call handles gpt-image-2 image jobs without browser automation. Your orchestrator handles judgment and security decisions in the main loop.

What Changed in April 2026

If you’re coming from an older multi-model guide: Anthropic blocked subscription OAuth (Claude Max) from third-party harnesses in April 2026. The claude-cli backend no longer works as a main-agent backend. Opus remains available through Claude Code’s first-party harness, but only as an escalation target, not the primary orchestrator. As of the June 2026 stack notes, prefer the Claude Code tmux relay for review and keep ACPX for explicit ACP compatibility.

See claude-cli → ACP migration for the full migration runbook.

The model chain below reflects the post-block world.

The Model Chain

Always use the cheapest model that can handle the task. Escalate up only when the work demands it.

Cheap vs Capable Is About Blast Radius, Not Just Budget

The tier rule has a second axis that the cost table hides. The question is never just “can a small model do this task?” It’s “what does it cost me when the small model does the task wrong?”

The incident that taught me this: a Haiku-class sub-agent on a routine cron job read a local API’s OpenAPI spec, found a DELETE /api/index endpoint, and called it three times unprompted. It wiped 71,000 indexed chunks and 28,000 LLM-generated summaries, roughly $40 of inference, on a job whose own inference cost was a fraction of a cent. The full post-mortem is in agent security hardening, and the security lesson there is real: any model with tool access will eventually exercise every endpoint it can see, so remove the destructive paths.

But there’s a model-selection lesson too. Match the tier to the judgment the lane requires, then to the blast radius the lane permits:

The rule I actually operate by: cheapest model that can handle the task, in the most fenced lane the task tolerates. Cost-per-token and blast-radius-per-mistake are both part of “can handle.”

Tier 1: Local Models via Ollama (Free)

Zero API costs. Zero latency. Zero data leaving your machine.

Handles:

Embedding system:

The current retrieval system standardizes on qwen3-embedding:8b through Ollama’s local OpenAI-compatible endpoint.

OpenClaw memory search embeds the incoming query, then compares it against stored memory vectors. Code search uses the same embedding model, but it stores two kinds of vectors: direct code-chunk vectors and natural-language summary vectors. The summary vectors carry more of the search weight because humans usually search by intent, not by exact symbol names.

ModelRoleWhy it is used
qwen3-embedding:8bEmbeddings for memory search, code search, and semantic similarityLocal, zero API cost, 4096-dimensional vectors, strong enough retrieval quality, and one consistent vector space across memory and code
qwen3-coder-next:cloudSummary helper before embedding code chunksCheap structured summaries with good identifier retention. It improves semantic search, but it is not the embedding model

Do not swap embedding models casually. If the embedding model changes, re-index the stored vectors instead of only changing config.

Setup:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-embedding:8b    # embeddings for memory + code search
ollama pull qwen3-coder:14b       # commit messages, small code tasks
ollama pull qwen3:7b              # triage/screening

OpenClaw config (memory search using Ollama as OpenAI-compatible endpoint):

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "openai",
        "remote": {
          "baseUrl": "http://localhost:11434/v1/",
          "apiKey": "ollama"
        },
        "fallback": "none",
        "model": "qwen3-embedding:8b"
      }
    }
  }
}

Hardware: Any NVIDIA GPU with 8GB+ VRAM handles these models. Even a laptop GPU works for the embedding model.

Tier 1b: Ollama Cloud Pro ($20/mo)

Ollama Cloud is the middle lane between local models and frontier subscriptions. It is useful for bulk summarization, strict-format offload chores, commit message and changelog prep, and model bakeoffs where you want cheap cloud inference without moving the main orchestrator.

Current routing from our April 2026 bakeoffs:

Code-search backfill gauntlet, April 25 and April 28:

ModelSuccessMedianP95Key retentionDecision
qwen3-coder-next:cloud100/1001.64s3.01s0.293Primary
kimi-k2.6:cloud100/1002.59s4.76s0.296Fallback
gemma4:31b-cloud100/1002.16s31.11s0.247Reject for bulk summaries
deepseek-v4-flash:cloud100/1001.88s14.41s0.288Candidate
deepseek-v4-pro:cloud93/1002.29s56.24s0.218Reject
deepseek-v3.2:cloud100/1005.61s8.78s0.212Too slow
minimax-m2.7:cloud27/1005.93s9.69s0.189Reject

Setup:

ollama signin
ollama pull qwen3-coder-next:cloud
ollama pull kimi-k2.6:cloud
ollama pull deepseek-v4-flash:cloud
ollama pull deepseek-v4-pro:cloud

Local tools can call cloud models through the localhost Ollama daemon after ollama signin. For direct hosted calls to https://ollama.com/api, use the provider auth flow documented by Ollama. For most OpenClaw automation, the simpler path is to sign in once with ollama signin and let local tools call the cloud models through the localhost Ollama daemon.

Ollama Pro is currently $20/month, includes 50x more cloud usage than Free, and allows 3 concurrent cloud models. Ollama documents usage as infrastructure utilization rather than a fixed token cap, with session limits resetting every 5 hours and weekly limits resetting every 7 days.

Tier 2: Orchestrator: GPT 5.5 via Codex Pro ($200/mo)

Your main agent. This is what receives every message, makes every decision, and spawns sub-agents for the heavy lifting.

Why GPT 5.5 on Codex Pro:

This is the recommended $200 stack when the agent is active every day or the same subscription is also doing heavy repo work. A $100-ish setup can work for conservative usage, but only if cron work is light, local/Ollama lanes take the boring tasks, and you are not sharing the subscription with constant coding sessions.

Handles:

Config:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openai-codex/gpt-5.5",
        "fallbacks": [
          "openai-codex/gpt-5.3-codex"
        ]
      },
      "models": {
        "openai-codex/gpt-5.5": {
          "alias": "gpt55",
          "params": { "thinking": "medium" }
        },
        "openai-codex/gpt-5.5:cron": {
          "alias": "gpt55cron",
          "params": { "thinking": "low" }
        },
        "openai-codex/gpt-5.5:high": {
          "alias": "gpt55hi",
          "params": { "thinking": "high" }
        }
      }
    }
  }
}

The :cron and :high variants are the same model with different thinking budgets. Use :cron for scheduled background tasks where latency matters more than depth. Use :high for design work and architectural decisions.

Fallback chain ordering matters. Keep the fallback chain on providers you actually use. We keep gpt-5.3-codex as the sole fallback. Both models share the Codex Pro subscription, so a fallback hop doesn’t change your billing surface. Adding providers you don’t actively run to the chain is asking for silent quality drops when the primary hiccups.

Focused harness sub-agents

For serious work, do not treat every sub-agent as the same OpenClaw session with a different model label. The harness matters.

Use a focused codex-coder lane for builds and refactors. That lane should run GPT 5.5 through the Codex CLI harness rather than the default OpenClaw Pi runtime. Codex CLI gives you the right repo workflow: file edits, terminal feedback, test loops, and persistent coding context.

Use a focused opus-review lane for specialized review. That lane should run Opus through Claude Code in a named tmux session by default. Claude Code keeps the review lane inside Anthropic’s first-party harness while OpenClaw treats tmux as the controllable relay.

Focused agentHarnessUse it for
mainOpenClaw default runtimeConversation handling, routing, tool orchestration, safety decisions
codex-coderCodex CLI with GPT 5.5Multi-file builds, refactors, test-driven fixes, repository work
opus-reviewClaude Code tmux relay with OpusArchitecture review, security review, design critique, high-context analysis

The model is only part of the system. The harness decides how file edits, approvals, terminal commands, session persistence, and repository context behave.

Tier 3: Browser-LLM Stack: Playwright + noVNC

Instead of fighting OAuth policy changes for research and UI-only workflows, we drive the web UIs of frontier models through Playwright. A persistent Chromium runs under Xvfb with noVNC attached so you can see what the headless browser is doing.

Handles:

Why browser-driven instead of a CLI/API tier:

Setup sketch:

This swaps a CLI backend for a tool surface. Your orchestrator calls the browser skill like any other tool, and the response comes back as text the agent can reason over.

Tier 3a: Built-in OpenClaw Image Generation: gpt-image-2

Current OpenClaw has a first-class image generation call. Use it before browser automation for normal image jobs.

The OpenAI image provider defaults to gpt-image-2 when configured. It supports generation, edits with up to five reference images, PNG/JPEG/WebP output, and common square, portrait, landscape, and 4K sizes.

Example call shape:

image_generate({
  model: "openai/gpt-image-2",
  prompt: "Clean technical diagram of a multi-model agent stack",
  size: "2048x1152",
  outputFormat: "png"
})

Use the browser path when the job needs a web-only product feature, a logged-in UI workflow, or manual visual review. Otherwise, image_generate is cleaner, repeatable, and easier to wire into automation.

Tier 4: Escalation: Opus via Claude Code tmux relay

Opus is no longer the main agent. It is now an escalation target for specific review and reasoning tasks where the quality difference matters.

Handles:

How to invoke:

Two paths:

  1. Preferred tmux relay: Start a named Claude Code tmux session in --permission-mode plan, send a bounded review prompt, then capture the pane as a local review artifact.
  2. ACP compatibility: Use ACPX only when your OpenClaw setup needs an ACP endpoint.

The tmux relay helper lives in ../templates/ai-stack/claude-tmux-relay.sh. See Claude Code via tmux Relay for OpenClaw and Codex handoff commands. See the ACP migration guide only when you need ACPX compatibility.

When NOT to escalate: Code generation, file scanning, bulk edits, and mechanical ops work. Escalation is for judgment, not labor.

Example: How a Request Flows Through the Chain

1. Email arrives
   → Ollama (7B) triages: spam? SKIP. Important? ESCALATE.

2. Escalated email
   → GPT 5.5 reads it and decides the response strategy

3. "Build me a dashboard"
   → GPT 5.5 creates the PRD and component spec
   → Spawns `codex-coder` through the Codex CLI harness to build it
   → Orchestrator reviews the output and runs the verification gate

4. "Deep research this topic before I make a decision"
   → GPT 5.5 calls the browser research skill (Perplexity Pro via Playwright)
   → Skill returns structured findings, orchestrator synthesizes

5. "Review this PR for architectural soundness"
   → GPT 5.5 recognizes escalation criteria, sends the diff to Claude Code through tmux
   → Opus reviews in plan mode and returns structured findings

6. Git commit
   → Ollama generates commit message locally. Zero API cost.

7. Memory search
   → Ollama embeds query with qwen3-embedding:8b, searches local vector store. Free.

The expensive escalation model only touches step 5. Everything else stays on the subscription tiers, uses the built-in image call, runs in the browser against existing web subscriptions, or runs free.

OpenClaw Agent Configuration

Define agents in the agents.list section of your openclaw.json:

{
  "agents": {
    "list": [
      { "id": "main", "model": "openai-codex/gpt-5.5" },
      { "id": "coder", "model": "gpt55" }
    ]
  }
}

Aliases resolve against agents.defaults.models. So gpt55 above resolves to the configured GPT 5.5 Codex model.

Research is not a separate agent in this setup. It is a skill the main/coder invoke via the browser stack (see Tier 3).

Spawn focused sub-agents by harness, not just by model:

# Serious repo work through Codex CLI
sessions_spawn(runtime: "acp", agentId: "codex", task: "Build CRUD routes for this schema: ...")

# Review lane through Claude Code's tmux relay
templates/ai-stack/claude-tmux-relay.sh send \
  "Review this architecture for failure modes. Findings first. Do not edit files."

Token Optimization Patterns

Heartbeat Batching

Instead of separate cron jobs for email, calendar, and notifications, batch them into one heartbeat. One context load, multiple checks. Saves thousands of input tokens daily.

Sub-Agent Isolation

Spawn sub-agents for tasks that don’t need your main session’s context. A coder agent building a React component doesn’t need your email history or personal notes. Isolated sessions start clean.

Prompt Compression

Write tight, specific prompts for sub-agents. “Build CRUD routes for this schema” with the schema attached beats “Read all these files and figure out what to build.” Less input tokens, better output.

Thinking-Budget Tuning

The gpt-5.5:cron alias with thinking: low saves real tokens on scheduled work. A 5-minute email triage doesn’t need medium thinking. Reserve medium/high for interactive work.

Approximate Cost Breakdown

TierMonthly CostWhat It Does% of Work
Ollama (local)$0Embeddings, commits, triage~40%
Ollama Pro cloud$20Bulk summaries, strict offload, cheap model bakeoffsbursty
Built-in image generationprovider-backedgpt-image-2 generation and editsusage-based
Browser-LLM stackreuse existing web subsResearch, web-only workflows, second opinions~10%
Codex Pro$200Orchestration + Codex CLI build lane~45%
Opus via Claude Code tmux relaybundledEscalation only~5%

The heavy lifter is Codex Pro. Opus through Claude Code is a quality escalation, not a workhorse, so it stays within the Max subscription’s usage envelope. Built-in image generation follows the configured provider billing. The browser-LLM stack costs whatever your existing Perplexity, Gemini, ChatGPT, or Claude.ai subscriptions already cost. There is no additional per-request billing layered on top.

Verification

Check your agent configuration:

# Verify agents are configured
jq '.agents.list | map({id, model})' ~/.openclaw/openclaw.json

# Verify primary + fallback chain
jq '.agents.defaults.model' ~/.openclaw/openclaw.json

# Verify Ollama is running with the embedding model
curl -s http://127.0.0.1:11434/api/tags | jq '.models[] | select(.name | contains("embed")) | .name'

# Verify a Claude Code tmux review session is reachable
tmux has-session -t claude-code-review

# If you use ACP compatibility, verify ACPX plugin is loaded
jq '.plugins.allow | contains(["acpx"])' ~/.openclaw/openclaw.json

Gotchas

  1. Pre-flight check your agents. Before spawning, verify the agent ID maps to the model you expect. We got burned spawning Opus for code gen because the coder agent was temporarily misconfigured. jq '.agents.list' ~/.openclaw/openclaw.json is cheap insurance.

  2. Don’t put budget models on untrusted input. Your main orchestrator will encounter prompt injections in email, web scrapes, and group chats. That needs GPT 5.5 at minimum, not a local 7B.

  3. Ollama binds to 127.0.0.1 by default. This is correct. Don’t change it to 0.0.0.0 unless you have firewall rules restricting access. See the Linux hardening guide.

  4. Subscription rate limits are real. Codex Pro has weekly and hourly limits. Ollama Pro has session and weekly cloud limits, plus concurrency limits. The model chain helps: if 40% of your work runs on local Ollama, cloud bulk work goes through Ollama Pro, and 10% goes through the browser stack against your existing web subscriptions, you stay well within Codex’s envelope.

  5. OpenAI OAuth rotating refresh tokens. The Codex CLI desktop app and OpenClaw share the same refresh token. When one refreshes, the other’s stored copy is invalidated. Symptom: 401 refresh_token_reused. Fix: refresh the Codex/OpenClaw auth flow, then restart the gateway.

  6. openclaw models auth login doesn’t see openai-codex. It only surfaces plugin providers. Codex OAuth is baked into the onboard wizard. Use openclaw onboard --auth-choice openai-codex or the documented auth refresh path.

  7. ACPX binary is user-local. If you still use ACP compatibility, it is installed under OpenClaw user-local vendor storage, not in a global location. After OpenClaw upgrades, verify the plugins.entries.acpx block is still present. Upgrades have been observed to reset plugin config.

  8. Xvfb starts black. The headless X display Playwright runs against is black until Chromium actually loads a page. If you VNC in and see a black screen, that’s normal. Trigger a skill run and the browser will appear. Don’t restart Xvfb in a panic.

  9. Browser skills need per-provider flock locks. Two concurrent skill invocations on the same Chromium profile will clobber each other. A flock on /tmp/browser-<provider>.lock around the skill entry point keeps concurrent calls serialized per provider while different providers run in parallel. This is in the skill itself, not OpenClaw config. Get it right once, forget about it.