Multi-Model Orchestration
How to run multiple AI models in one OpenClaw setup, assign each to the right task tier, and stop burning expensive tokens on work that doesn’t need them.
Why Multi-Model Matters
Running one model for everything is like hiring a senior architect to answer phones. Your orchestrator needs to be strong enough to handle ambiguity and adversarial input. Everything else can run cheaper or free.
This isn’t about saving money. It’s about using the right tool for each job. A local embedding model handles memory and code retrieval better than a frontier chat model wasting quota on vector work. Browser-driven LLMs handle research and UI-only workflows, while OpenClaw’s built-in image generation call handles gpt-image-2 image jobs without browser automation. Your orchestrator handles judgment and security decisions in the main loop.
What Changed in April 2026
If you’re coming from an older multi-model guide: Anthropic blocked subscription OAuth (Claude Max) from third-party harnesses in April 2026. The claude-cli backend no longer works as a main-agent backend. Opus remains available through Claude Code’s first-party harness, but only as an escalation target, not the primary orchestrator. As of the June 2026 stack notes, prefer the Claude Code tmux relay for review and keep ACPX for explicit ACP compatibility.
See claude-cli → ACP migration for the full migration runbook.
The model chain below reflects the post-block world.
The Model Chain
Always use the cheapest model that can handle the task. Escalate up only when the work demands it.
Cheap vs Capable Is About Blast Radius, Not Just Budget
The tier rule has a second axis that the cost table hides. The question is never just “can a small model do this task?” It’s “what does it cost me when the small model does the task wrong?”
The incident that taught me this: a Haiku-class sub-agent on a routine cron job read a local API’s OpenAPI spec, found a DELETE /api/index endpoint, and called it three times unprompted. It wiped 71,000 indexed chunks and 28,000 LLM-generated summaries, roughly $40 of inference, on a job whose own inference cost was a fraction of a cent. The full post-mortem is in agent security hardening, and the security lesson there is real: any model with tool access will eventually exercise every endpoint it can see, so remove the destructive paths.
But there’s a model-selection lesson too. Match the tier to the judgment the lane requires, then to the blast radius the lane permits:
- Cheap model, read-only lane: the sweet spot. Triage, summaries, classification, embeddings. A wrong answer costs you a re-run. This is where Tier 1 earns its keep.
- Cheap model, writable lane: the trap. The savings on inference are dwarfed by one bad tool call. If a lane can mutate state, either fence the lane until a wrong action is survivable, or pay for the model that can tell a cleanup endpoint from a self-destruct button.
- Expensive model, trivial lane: the slow leak. Spawning the escalation model for work a small model handles is the same mistake in reverse. It burns quota you’ll want later and shows up as a correction in every token audit I run (see self-improving agents).
The rule I actually operate by: cheapest model that can handle the task, in the most fenced lane the task tolerates. Cost-per-token and blast-radius-per-mistake are both part of “can handle.”
Tier 1: Local Models via Ollama (Free)
Zero API costs. Zero latency. Zero data leaving your machine.
Handles:
- Semantic memory search embeddings
- Code search embeddings
- Git commit message generation
- Cron job triage (ESCALATE/SKIP decisions)
Embedding system:
The current retrieval system standardizes on qwen3-embedding:8b through Ollama’s local OpenAI-compatible endpoint.
OpenClaw memory search embeds the incoming query, then compares it against stored memory vectors. Code search uses the same embedding model, but it stores two kinds of vectors: direct code-chunk vectors and natural-language summary vectors. The summary vectors carry more of the search weight because humans usually search by intent, not by exact symbol names.
| Model | Role | Why it is used |
|---|---|---|
qwen3-embedding:8b | Embeddings for memory search, code search, and semantic similarity | Local, zero API cost, 4096-dimensional vectors, strong enough retrieval quality, and one consistent vector space across memory and code |
qwen3-coder-next:cloud | Summary helper before embedding code chunks | Cheap structured summaries with good identifier retention. It improves semantic search, but it is not the embedding model |
Do not swap embedding models casually. If the embedding model changes, re-index the stored vectors instead of only changing config.
Setup:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-embedding:8b # embeddings for memory + code search
ollama pull qwen3-coder:14b # commit messages, small code tasks
ollama pull qwen3:7b # triage/screening
OpenClaw config (memory search using Ollama as OpenAI-compatible endpoint):
{
"agents": {
"defaults": {
"memorySearch": {
"provider": "openai",
"remote": {
"baseUrl": "http://localhost:11434/v1/",
"apiKey": "ollama"
},
"fallback": "none",
"model": "qwen3-embedding:8b"
}
}
}
}
Hardware: Any NVIDIA GPU with 8GB+ VRAM handles these models. Even a laptop GPU works for the embedding model.
Tier 1b: Ollama Cloud Pro ($20/mo)
Ollama Cloud is the middle lane between local models and frontier subscriptions. It is useful for bulk summarization, strict-format offload chores, commit message and changelog prep, and model bakeoffs where you want cheap cloud inference without moving the main orchestrator.
Current routing from our April 2026 bakeoffs:
qwen3-coder-next:cloud: best default for code-search summaries and strict structured offload.kimi-k2.6:cloud: best tested fallback for code-search summaries, strongest identifier retention, but slower and more verbose.gemma4:31b-cloud: clean generalist, but reject for bulk code-search summaries because of lower retention and severe tail latency.deepseek-v4-flash:cloud: promising, but needs thinking behavior controlled and had worse tail latency.deepseek-v4-pro:cloud: clean when it returns, but reject for bulk code-search summaries because the April 28 run had 7 failures, two 180s timeouts, low retention, and severe tail latency.minimax-m2.7:cloud: useful for deeper supervised agent work, not for terse code-search summaries.
Code-search backfill gauntlet, April 25 and April 28:
| Model | Success | Median | P95 | Key retention | Decision |
|---|---|---|---|---|---|
qwen3-coder-next:cloud | 100/100 | 1.64s | 3.01s | 0.293 | Primary |
kimi-k2.6:cloud | 100/100 | 2.59s | 4.76s | 0.296 | Fallback |
gemma4:31b-cloud | 100/100 | 2.16s | 31.11s | 0.247 | Reject for bulk summaries |
deepseek-v4-flash:cloud | 100/100 | 1.88s | 14.41s | 0.288 | Candidate |
deepseek-v4-pro:cloud | 93/100 | 2.29s | 56.24s | 0.218 | Reject |
deepseek-v3.2:cloud | 100/100 | 5.61s | 8.78s | 0.212 | Too slow |
minimax-m2.7:cloud | 27/100 | 5.93s | 9.69s | 0.189 | Reject |
Setup:
ollama signin
ollama pull qwen3-coder-next:cloud
ollama pull kimi-k2.6:cloud
ollama pull deepseek-v4-flash:cloud
ollama pull deepseek-v4-pro:cloud
Local tools can call cloud models through the localhost Ollama daemon after ollama signin. For direct hosted calls to https://ollama.com/api, use the provider auth flow documented by Ollama. For most OpenClaw automation, the simpler path is to sign in once with ollama signin and let local tools call the cloud models through the localhost Ollama daemon.
Ollama Pro is currently $20/month, includes 50x more cloud usage than Free, and allows 3 concurrent cloud models. Ollama documents usage as infrastructure utilization rather than a fixed token cap, with session limits resetting every 5 hours and weekly limits resetting every 7 days.
Tier 2: Orchestrator: GPT 5.5 via Codex Pro ($200/mo)
Your main agent. This is what receives every message, makes every decision, and spawns sub-agents for the heavy lifting.
Why GPT 5.5 on Codex Pro:
- Subscription cost is predictable ($200/mo flat)
- Codex OAuth works with OpenClaw’s primary-model slot
- GPT 5.5 is strong enough for orchestration, tool use, and delegation
:cronand:highaliases let you tune thinking depth per task
This is the recommended $200 stack when the agent is active every day or the same subscription is also doing heavy repo work. A $100-ish setup can work for conservative usage, but only if cron work is light, local/Ollama lanes take the boring tasks, and you are not sharing the subscription with constant coding sessions.
Handles:
- Main orchestration (processing incoming messages, deciding actions)
- Code generation and review (coder role is the same model)
- File scanning, grep, bulk ops
- Architecture and planning
- Security decisions (evaluating untrusted input)
- Anything touching untrusted content (email, web scraping, group chats)
Config:
{
"agents": {
"defaults": {
"model": {
"primary": "openai-codex/gpt-5.5",
"fallbacks": [
"openai-codex/gpt-5.3-codex"
]
},
"models": {
"openai-codex/gpt-5.5": {
"alias": "gpt55",
"params": { "thinking": "medium" }
},
"openai-codex/gpt-5.5:cron": {
"alias": "gpt55cron",
"params": { "thinking": "low" }
},
"openai-codex/gpt-5.5:high": {
"alias": "gpt55hi",
"params": { "thinking": "high" }
}
}
}
}
}
The :cron and :high variants are the same model with different thinking budgets. Use :cron for scheduled background tasks where latency matters more than depth. Use :high for design work and architectural decisions.
Fallback chain ordering matters. Keep the fallback chain on providers you actually use. We keep gpt-5.3-codex as the sole fallback. Both models share the Codex Pro subscription, so a fallback hop doesn’t change your billing surface. Adding providers you don’t actively run to the chain is asking for silent quality drops when the primary hiccups.
Focused harness sub-agents
For serious work, do not treat every sub-agent as the same OpenClaw session with a different model label. The harness matters.
Use a focused codex-coder lane for builds and refactors. That lane should run GPT 5.5 through the Codex CLI harness rather than the default OpenClaw Pi runtime. Codex CLI gives you the right repo workflow: file edits, terminal feedback, test loops, and persistent coding context.
Use a focused opus-review lane for specialized review. That lane should run Opus through Claude Code in a named tmux session by default. Claude Code keeps the review lane inside Anthropic’s first-party harness while OpenClaw treats tmux as the controllable relay.
| Focused agent | Harness | Use it for |
|---|---|---|
main | OpenClaw default runtime | Conversation handling, routing, tool orchestration, safety decisions |
codex-coder | Codex CLI with GPT 5.5 | Multi-file builds, refactors, test-driven fixes, repository work |
opus-review | Claude Code tmux relay with Opus | Architecture review, security review, design critique, high-context analysis |
The model is only part of the system. The harness decides how file edits, approvals, terminal commands, session persistence, and repository context behave.
Tier 3: Browser-LLM Stack: Playwright + noVNC
Instead of fighting OAuth policy changes for research and UI-only workflows, we drive the web UIs of frontier models through Playwright. A persistent Chromium runs under Xvfb with noVNC attached so you can see what the headless browser is doing.
Handles:
- Deep research via Perplexity Pro
- Long-context analysis via the Gemini web UI (search grounding, file uploads)
- Web-only visual workflows that the built-in image call does not cover
- “Second-opinion” passes against Claude.ai web or Gemini web from the orchestrator
Why browser-driven instead of a CLI/API tier:
- Uses the subscriptions you already pay for without API-billing complications
- Survives provider policy changes that break direct-API or OAuth paths (Anthropic’s April 2026 move is exactly this class of problem)
- Gets features that aren’t on the API (Perplexity’s research mode, search-grounded responses, UI-only rendering paths)
- Concurrent sessions work via separate Chromium profiles and
flock-based locking
Setup sketch:
- Xvfb + Chromium running headed-but-hidden
- x11vnc → noVNC so you (or an agent) can pop the session open in a browser
- Playwright-based skill in your workspace that launches against the persistent profile, performs the task, returns text/screenshots
- One profile per provider;
flockon a per-provider lockfile so concurrent skill invocations serialize cleanly
This swaps a CLI backend for a tool surface. Your orchestrator calls the browser skill like any other tool, and the response comes back as text the agent can reason over.
Tier 3a: Built-in OpenClaw Image Generation: gpt-image-2
Current OpenClaw has a first-class image generation call. Use it before browser automation for normal image jobs.
The OpenAI image provider defaults to gpt-image-2 when configured. It supports generation, edits with up to five reference images, PNG/JPEG/WebP output, and common square, portrait, landscape, and 4K sizes.
Example call shape:
image_generate({
model: "openai/gpt-image-2",
prompt: "Clean technical diagram of a multi-model agent stack",
size: "2048x1152",
outputFormat: "png"
})
Use the browser path when the job needs a web-only product feature, a logged-in UI workflow, or manual visual review. Otherwise, image_generate is cleaner, repeatable, and easier to wire into automation.
Tier 4: Escalation: Opus via Claude Code tmux relay
Opus is no longer the main agent. It is now an escalation target for specific review and reasoning tasks where the quality difference matters.
Handles:
- Architecture review
- Security review
- Design critique
- Long-form reasoning and research-heavy analysis
- PR review when a second opinion is worth the quota
How to invoke:
Two paths:
- Preferred tmux relay: Start a named Claude Code tmux session in
--permission-mode plan, send a bounded review prompt, then capture the pane as a local review artifact. - ACP compatibility: Use ACPX only when your OpenClaw setup needs an ACP endpoint.
The tmux relay helper lives in ../templates/ai-stack/claude-tmux-relay.sh. See Claude Code via tmux Relay for OpenClaw and Codex handoff commands. See the ACP migration guide only when you need ACPX compatibility.
When NOT to escalate: Code generation, file scanning, bulk edits, and mechanical ops work. Escalation is for judgment, not labor.
Example: How a Request Flows Through the Chain
1. Email arrives
→ Ollama (7B) triages: spam? SKIP. Important? ESCALATE.
2. Escalated email
→ GPT 5.5 reads it and decides the response strategy
3. "Build me a dashboard"
→ GPT 5.5 creates the PRD and component spec
→ Spawns `codex-coder` through the Codex CLI harness to build it
→ Orchestrator reviews the output and runs the verification gate
4. "Deep research this topic before I make a decision"
→ GPT 5.5 calls the browser research skill (Perplexity Pro via Playwright)
→ Skill returns structured findings, orchestrator synthesizes
5. "Review this PR for architectural soundness"
→ GPT 5.5 recognizes escalation criteria, sends the diff to Claude Code through tmux
→ Opus reviews in plan mode and returns structured findings
6. Git commit
→ Ollama generates commit message locally. Zero API cost.
7. Memory search
→ Ollama embeds query with qwen3-embedding:8b, searches local vector store. Free.
The expensive escalation model only touches step 5. Everything else stays on the subscription tiers, uses the built-in image call, runs in the browser against existing web subscriptions, or runs free.
OpenClaw Agent Configuration
Define agents in the agents.list section of your openclaw.json:
{
"agents": {
"list": [
{ "id": "main", "model": "openai-codex/gpt-5.5" },
{ "id": "coder", "model": "gpt55" }
]
}
}
Aliases resolve against agents.defaults.models. So gpt55 above resolves to the configured GPT 5.5 Codex model.
Research is not a separate agent in this setup. It is a skill the main/coder invoke via the browser stack (see Tier 3).
Spawn focused sub-agents by harness, not just by model:
# Serious repo work through Codex CLI
sessions_spawn(runtime: "acp", agentId: "codex", task: "Build CRUD routes for this schema: ...")
# Review lane through Claude Code's tmux relay
templates/ai-stack/claude-tmux-relay.sh send \
"Review this architecture for failure modes. Findings first. Do not edit files."
Token Optimization Patterns
Heartbeat Batching
Instead of separate cron jobs for email, calendar, and notifications, batch them into one heartbeat. One context load, multiple checks. Saves thousands of input tokens daily.
Sub-Agent Isolation
Spawn sub-agents for tasks that don’t need your main session’s context. A coder agent building a React component doesn’t need your email history or personal notes. Isolated sessions start clean.
Prompt Compression
Write tight, specific prompts for sub-agents. “Build CRUD routes for this schema” with the schema attached beats “Read all these files and figure out what to build.” Less input tokens, better output.
Thinking-Budget Tuning
The gpt-5.5:cron alias with thinking: low saves real tokens on scheduled work. A 5-minute email triage doesn’t need medium thinking. Reserve medium/high for interactive work.
Approximate Cost Breakdown
| Tier | Monthly Cost | What It Does | % of Work |
|---|---|---|---|
| Ollama (local) | $0 | Embeddings, commits, triage | ~40% |
| Ollama Pro cloud | $20 | Bulk summaries, strict offload, cheap model bakeoffs | bursty |
| Built-in image generation | provider-backed | gpt-image-2 generation and edits | usage-based |
| Browser-LLM stack | reuse existing web subs | Research, web-only workflows, second opinions | ~10% |
| Codex Pro | $200 | Orchestration + Codex CLI build lane | ~45% |
| Opus via Claude Code tmux relay | bundled | Escalation only | ~5% |
The heavy lifter is Codex Pro. Opus through Claude Code is a quality escalation, not a workhorse, so it stays within the Max subscription’s usage envelope. Built-in image generation follows the configured provider billing. The browser-LLM stack costs whatever your existing Perplexity, Gemini, ChatGPT, or Claude.ai subscriptions already cost. There is no additional per-request billing layered on top.
Verification
Check your agent configuration:
# Verify agents are configured
jq '.agents.list | map({id, model})' ~/.openclaw/openclaw.json
# Verify primary + fallback chain
jq '.agents.defaults.model' ~/.openclaw/openclaw.json
# Verify Ollama is running with the embedding model
curl -s http://127.0.0.1:11434/api/tags | jq '.models[] | select(.name | contains("embed")) | .name'
# Verify a Claude Code tmux review session is reachable
tmux has-session -t claude-code-review
# If you use ACP compatibility, verify ACPX plugin is loaded
jq '.plugins.allow | contains(["acpx"])' ~/.openclaw/openclaw.json
Gotchas
-
Pre-flight check your agents. Before spawning, verify the agent ID maps to the model you expect. We got burned spawning Opus for code gen because the coder agent was temporarily misconfigured.
jq '.agents.list' ~/.openclaw/openclaw.jsonis cheap insurance. -
Don’t put budget models on untrusted input. Your main orchestrator will encounter prompt injections in email, web scrapes, and group chats. That needs GPT 5.5 at minimum, not a local 7B.
-
Ollama binds to 127.0.0.1 by default. This is correct. Don’t change it to 0.0.0.0 unless you have firewall rules restricting access. See the Linux hardening guide.
-
Subscription rate limits are real. Codex Pro has weekly and hourly limits. Ollama Pro has session and weekly cloud limits, plus concurrency limits. The model chain helps: if 40% of your work runs on local Ollama, cloud bulk work goes through Ollama Pro, and 10% goes through the browser stack against your existing web subscriptions, you stay well within Codex’s envelope.
-
OpenAI OAuth rotating refresh tokens. The Codex CLI desktop app and OpenClaw share the same refresh token. When one refreshes, the other’s stored copy is invalidated. Symptom:
401 refresh_token_reused. Fix: refresh the Codex/OpenClaw auth flow, then restart the gateway. -
openclaw models auth logindoesn’t see openai-codex. It only surfaces plugin providers. Codex OAuth is baked into the onboard wizard. Useopenclaw onboard --auth-choice openai-codexor the documented auth refresh path. -
ACPX binary is user-local. If you still use ACP compatibility, it is installed under OpenClaw user-local vendor storage, not in a global location. After OpenClaw upgrades, verify the
plugins.entries.acpxblock is still present. Upgrades have been observed to reset plugin config. -
Xvfb starts black. The headless X display Playwright runs against is black until Chromium actually loads a page. If you VNC in and see a black screen, that’s normal. Trigger a skill run and the browser will appear. Don’t restart Xvfb in a panic.
-
Browser skills need per-provider flock locks. Two concurrent skill invocations on the same Chromium profile will clobber each other. A
flockon/tmp/browser-<provider>.lockaround the skill entry point keeps concurrent calls serialized per provider while different providers run in parallel. This is in the skill itself, not OpenClaw config. Get it right once, forget about it.