Memory Management & Token Optimization
How to build an AI agent memory system that doesn’t eat your context window alive. From raw conversation history to semantic search with local embeddings.
The Problem
Every OpenClaw session starts fresh. Your agent has no memory of previous conversations unless you build a system for it. The naive approach (dump everything into one file, load it every session) works until it doesn’t.
Here’s what happens without memory management:
- Token burn escalates. Each message carries the full memory payload. A 50K token memory file means 50K tokens of overhead on every single interaction.
- Quality degrades. Models get worse at following instructions when stuffed with irrelevant context. Important details get buried under noise.
- Sessions break. Long contexts cause coherence loss. The model repeats itself, hallucinates about old tasks, or loses track of what’s current.
We went through three iterations of memory architecture before landing on what works.
Architecture: Three-Tier Memory
Tier 1: Master Index (MEMORY.md)
A slim file (~2KB) loaded every session. Contains:
- Identity and quick context
- Agent architecture overview
- Links to where detailed info lives
- Category index of knowledge cards
Rules:
- Keep under 2KB. This goes into your system prompt every turn.
- Never dump raw logs here. Distill and point to cards.
- Edit only at session boundaries (edits invalidate prompt cache, see below).
Tier 2: Knowledge Cards (memory/cards/*.md)
Atomic files, one topic per card, ~350 tokens each. Searched semantically, loaded on demand.
memory/cards/
├── hardware-specs.md
├── active-ports.md
├── model-chain-rules.md
├── career-seu-intel.md
├── security-posture.md
└── ... (~40 cards)
Each card has YAML frontmatter for search:
---
topic: Security Audit & Hardening Status
category: security
tags: [security, firewall, ssh, audit]
created: 2026-02-20
updated: 2026-03-11
---
Rules:
- One topic per card. If you’re writing about two things, make two cards.
- ~350 tokens max. If it’s longer, split it.
- Update in place when information changes (new port assigned, project status change).
- Search first, load second. Use
memory_searchto find relevant cards instead of loading everything.
Tier 3: Daily Logs (memory/YYYY-MM-DD.md)
Raw session notes. What happened, what was decided, what broke.
memory/
├── 2026-03-17.md
├── 2026-03-16.md
├── 2026-03-15.md
└── ...
On session start, the agent skims today’s and yesterday’s logs for recent context. Older logs are only accessed through semantic search.
Rules:
- Write freely. These are journals, not polished docs.
- Periodically review and promote important findings to knowledge cards.
- Don’t load more than 2 days of logs into context.
Semantic Memory Search with Ollama
The key to making this work: instead of loading entire files into context, search for what’s relevant and load only those chunks.
Setup
Install Ollama and pull an embedding model:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-embedding:8b
Configure OpenClaw to use it (nested under agents.defaults.memorySearch):
{
"agents": {
"defaults": {
"memorySearch": {
"provider": "openai",
"remote": {
"baseUrl": "http://localhost:11434/v1/",
"apiKey": "ollama"
},
"fallback": "none",
"model": "qwen3-embedding:8b"
}
}
}
}
Why qwen3-embedding:8b over nomic-embed-text (the 2026-03 recommendation): Qwen3 embeddings gave us noticeably better ranking on memory cards with mixed domains (security, infra, career, code). Nomic is still fine if 8GB VRAM is tight - it’s ~1.6GB on disk vs. ~5GB for qwen3-embedding.
How It Works
- Memory files get indexed by the embedding model (automatic on startup).
- Agent calls
memory_search(query="what did we decide about the API architecture?"). - Returns ranked results with file paths and line numbers.
- Agent calls
memory_get(path, from, lines)to pull just the relevant chunk.
Instead of loading 50K+ characters of memory, the agent loads maybe 500-1000 characters of exactly what it needs.
Before and After
| Metric | Before (Full Load) | After (Semantic Search) |
|---|---|---|
| Memory tokens per turn | 50-100K | 500-2K |
| Session quality | Degrades after 30 min | Consistent all day |
| Search accuracy | Manual (agent reads everything) | Ranked by relevance |
| API cost for memory | $0.25-0.50/turn (Opus) | ~$0.005/turn |
| Local embedding cost | N/A | $0 (Ollama) |
That’s a 50-100x reduction in memory-related token usage.
Prompt Caching: Don’t Break It
Anthropic caches your system prompt prefix server-side. Cached tokens cost 90% less than uncached. OpenClaw handles this automatically, but you can break it.
What Gets Cached
Your bootstrap files load in this order (hardcoded in OpenClaw):
1. IDENTITY.md
2. SOUL.md
3. AGENTS.md
4. MEMORY.md
5. TOOLS.md
6. USER.md
7. HEARTBEAT.md
8. BOOTSTRAP.md
9. Hook-injected files
10. Skills prompt
11. Tool definitions
12. User message + conversation
Everything from 1-11 forms the cacheable prefix. If ANY byte changes, the cache invalidates and you pay full price for the entire prefix on the next turn.
Cache Hygiene Rules
-
Never edit SOUL.md, AGENTS.md, or TOOLS.md mid-session. These form the cached prefix. Edit only at session boundaries.
-
Keep MEMORY.md slim. Every edit invalidates the prefix cache. Write to knowledge cards instead (they’re searched, not loaded into the prompt).
-
Hook-injected files must be deterministic. No timestamps, no per-request dynamic content. Static strings only.
-
Don’t add/remove skills mid-session. The skill list is part of the prefix.
Anti-Patterns
| Pattern | Why It Breaks Cache | Fix |
|---|---|---|
| Edit SOUL.md mid-session | Prefix bytes change | Use system messages instead |
| Add timestamps to bootstrap files | Different every request | Move to user message |
| Add/remove skills mid-session | Tool list changes prefix | Keep stable from session start |
| Edit files to communicate state | File content in prefix changes | Use tool results/messages |
Cost Impact
Two failure modes depending on your provider:
Pay-per-token (direct Anthropic API): Prefix cost drops ~90% with caching. A 10K-token prefix with caching runs ~$0.005/turn; without caching, ~$0.05/turn. One mid-session bootstrap edit at turn 25 costs ~$3.51 in extra spend over the remaining session.
Subscription (Codex Pro, Claude Max via ACP): You don’t see dollars - you see rate-limit headroom. A session that used to last 4 hours hits the cap at 2.5 hours if you keep invalidating the prefix. Same pain, different dashboard. See prompt caching for provider-specific detail.
Memory Maintenance
Schedule periodic maintenance (we do it during heartbeats every few days):
- Read recent daily logs.
- Identify significant events, lessons, or decisions.
- Create or update knowledge cards for anything worth keeping.
- Remove outdated info from MEMORY.md.
- Archive daily logs older than 30 days if desired.
Think of it like a human reviewing their journal and updating their mental model. Daily files are raw notes. Knowledge cards are curated reference material.
Operational Cadence: Sweep, Ingest, Decay
The memory stack works best when you separate three jobs instead of stuffing them into one vague “maintenance” task.
1. Memory sweep
This is the judgment pass. Review recent sessions, pull durable decisions and corrections, update cards, and promote recurring mistakes into rules.
{
"name": "memory-sweep",
"schedule": {
"kind": "cron",
"expr": "0 */6 * * *",
"tz": "America/New_York"
},
"payload": {
"kind": "agentTurn",
"message": "Review recent sessions. Update daily logs. Create or update knowledge cards for durable decisions, corrections, and lessons. Skip trivial chatter.",
"model": "openai-codex/gpt-5.5"
},
"delivery": { "mode": "none" },
"sessionTarget": "isolated"
}
2. Handoff ingest
If other machines or Claude Code sessions emit memory handoffs, ingest them on a short cadence so they do not rot in a folder.
*/30 * * * * bash ~/.openclaw/workspace/scripts/run-memory-handoff-ingest.sh
3. Card staleness scan
This is the decay pass. It does not create new knowledge. It finds cards whose claims are aging out and queues them for refresh or manual review.
0 4 * * * python3 ~/.openclaw/workspace/scripts/card-decay-scanner.py
A clean division of labor helps:
| Job | Purpose | Typical cadence |
|---|---|---|
| Memory sweep | Distill recent work into durable memory | Every 6-12 hours |
| Handoff ingest | Pull machine-local memory proposals into canonical storage | Every 15-30 minutes |
| Card staleness scan | Flag aging or suspicious cards for refresh | Daily |
If you blur those together, you usually get one of two bad outcomes: either the sweep becomes bloated and slow, or stale cards sit around forever because nobody owns the decay loop.
Verification
# Check Ollama is running with embedding model
curl -s http://127.0.0.1:11434/api/tags | jq -r '.models[].name' | grep -i embed
# Expected: qwen3-embedding:8b (or whatever you configured)
# Check memory file structure
echo "=== Master Index ==="
wc -c ~/.openclaw/workspace/MEMORY.md
echo ""
echo "=== Knowledge Cards ==="
ls ~/.openclaw/workspace/memory/cards/ | wc -l
echo "cards"
echo ""
echo "=== Daily Logs ==="
ls ~/.openclaw/workspace/memory/20*.md 2>/dev/null | wc -l
echo "daily log files"
echo ""
echo "=== MEMORY.md Size Check ==="
SIZE=$(wc -c < ~/.openclaw/workspace/MEMORY.md)
if [ "$SIZE" -gt 4096 ]; then
echo "⚠ MEMORY.md is ${SIZE} bytes. Consider trimming (target: <2KB)"
else
echo "✓ MEMORY.md is ${SIZE} bytes (healthy)"
fi
Gotchas
-
Local embeddings are more than good enough. qwen3-embedding:8b (5GB) or nomic-embed-text (274M / 1.6GB) both beat round-tripping to OpenAI’s embedding API for memory search. You need good enough relevance ranking, not SOTA - and the round-trip latency alone makes cloud embeddings a worse experience.
-
Don’t load the backup. If you migrated from a monolithic MEMORY.md, the backup file might be 50-60KB. Never load it in a session. It exists for reference only.
-
Card frontmatter matters for search. The
tagsandtopicfields in YAML frontmatter improve semantic search accuracy. Don’t skip them. -
Memory search before answering. Make it a habit: before your agent answers questions about past decisions, dates, or people, it should search memory first. This catches things the agent “forgot” because they weren’t in today’s context.