Chapter V · Knowledge management

Memory Management & Token Optimization

Tested on
OpenClaw 2026.4.x with Ollama (qwen3-embedding:8b), 64GB RAM host
Last updated
2026-04-19

How to build an AI agent memory system that doesn’t eat your context window alive. From raw conversation history to semantic search with local embeddings.

The Problem

Every OpenClaw session starts fresh. Your agent has no memory of previous conversations unless you build a system for it. The naive approach (dump everything into one file, load it every session) works until it doesn’t.

Here’s what happens without memory management:

We went through three iterations of memory architecture before landing on what works.

Architecture: Three-Tier Memory

Tier 1: Master Index (MEMORY.md)

A slim file (~2KB) loaded every session. Contains:

Rules:

Tier 2: Knowledge Cards (memory/cards/*.md)

Atomic files, one topic per card, ~350 tokens each. Searched semantically, loaded on demand.

memory/cards/
├── hardware-specs.md
├── active-ports.md
├── model-chain-rules.md
├── career-seu-intel.md
├── security-posture.md
└── ... (~40 cards)

Each card has YAML frontmatter for search:

---
topic: Security Audit & Hardening Status
category: security
tags: [security, firewall, ssh, audit]
created: 2026-02-20
updated: 2026-03-11
---

Rules:

Tier 3: Daily Logs (memory/YYYY-MM-DD.md)

Raw session notes. What happened, what was decided, what broke.

memory/
├── 2026-03-17.md
├── 2026-03-16.md
├── 2026-03-15.md
└── ...

On session start, the agent skims today’s and yesterday’s logs for recent context. Older logs are only accessed through semantic search.

Rules:

Semantic Memory Search with Ollama

The key to making this work: instead of loading entire files into context, search for what’s relevant and load only those chunks.

Setup

Install Ollama and pull an embedding model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-embedding:8b

Configure OpenClaw to use it (nested under agents.defaults.memorySearch):

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "openai",
        "remote": {
          "baseUrl": "http://localhost:11434/v1/",
          "apiKey": "ollama"
        },
        "fallback": "none",
        "model": "qwen3-embedding:8b"
      }
    }
  }
}

Why qwen3-embedding:8b over nomic-embed-text (the 2026-03 recommendation): Qwen3 embeddings gave us noticeably better ranking on memory cards with mixed domains (security, infra, career, code). Nomic is still fine if 8GB VRAM is tight - it’s ~1.6GB on disk vs. ~5GB for qwen3-embedding.

How It Works

  1. Memory files get indexed by the embedding model (automatic on startup).
  2. Agent calls memory_search(query="what did we decide about the API architecture?").
  3. Returns ranked results with file paths and line numbers.
  4. Agent calls memory_get(path, from, lines) to pull just the relevant chunk.

Instead of loading 50K+ characters of memory, the agent loads maybe 500-1000 characters of exactly what it needs.

Before and After

MetricBefore (Full Load)After (Semantic Search)
Memory tokens per turn50-100K500-2K
Session qualityDegrades after 30 minConsistent all day
Search accuracyManual (agent reads everything)Ranked by relevance
API cost for memory$0.25-0.50/turn (Opus)~$0.005/turn
Local embedding costN/A$0 (Ollama)

That’s a 50-100x reduction in memory-related token usage.

Prompt Caching: Don’t Break It

Anthropic caches your system prompt prefix server-side. Cached tokens cost 90% less than uncached. OpenClaw handles this automatically, but you can break it.

What Gets Cached

Your bootstrap files load in this order (hardcoded in OpenClaw):

1. IDENTITY.md
2. SOUL.md
3. AGENTS.md
4. MEMORY.md
5. TOOLS.md
6. USER.md
7. HEARTBEAT.md
8. BOOTSTRAP.md
9. Hook-injected files
10. Skills prompt
11. Tool definitions
12. User message + conversation

Everything from 1-11 forms the cacheable prefix. If ANY byte changes, the cache invalidates and you pay full price for the entire prefix on the next turn.

Cache Hygiene Rules

  1. Never edit SOUL.md, AGENTS.md, or TOOLS.md mid-session. These form the cached prefix. Edit only at session boundaries.

  2. Keep MEMORY.md slim. Every edit invalidates the prefix cache. Write to knowledge cards instead (they’re searched, not loaded into the prompt).

  3. Hook-injected files must be deterministic. No timestamps, no per-request dynamic content. Static strings only.

  4. Don’t add/remove skills mid-session. The skill list is part of the prefix.

Anti-Patterns

PatternWhy It Breaks CacheFix
Edit SOUL.md mid-sessionPrefix bytes changeUse system messages instead
Add timestamps to bootstrap filesDifferent every requestMove to user message
Add/remove skills mid-sessionTool list changes prefixKeep stable from session start
Edit files to communicate stateFile content in prefix changesUse tool results/messages

Cost Impact

Two failure modes depending on your provider:

Pay-per-token (direct Anthropic API): Prefix cost drops ~90% with caching. A 10K-token prefix with caching runs ~$0.005/turn; without caching, ~$0.05/turn. One mid-session bootstrap edit at turn 25 costs ~$3.51 in extra spend over the remaining session.

Subscription (Codex Pro, Claude Max via ACP): You don’t see dollars - you see rate-limit headroom. A session that used to last 4 hours hits the cap at 2.5 hours if you keep invalidating the prefix. Same pain, different dashboard. See prompt caching for provider-specific detail.

Memory Maintenance

Schedule periodic maintenance (we do it during heartbeats every few days):

  1. Read recent daily logs.
  2. Identify significant events, lessons, or decisions.
  3. Create or update knowledge cards for anything worth keeping.
  4. Remove outdated info from MEMORY.md.
  5. Archive daily logs older than 30 days if desired.

Think of it like a human reviewing their journal and updating their mental model. Daily files are raw notes. Knowledge cards are curated reference material.

Operational Cadence: Sweep, Ingest, Decay

The memory stack works best when you separate three jobs instead of stuffing them into one vague “maintenance” task.

1. Memory sweep

This is the judgment pass. Review recent sessions, pull durable decisions and corrections, update cards, and promote recurring mistakes into rules.

{
  "name": "memory-sweep",
  "schedule": {
    "kind": "cron",
    "expr": "0 */6 * * *",
    "tz": "America/New_York"
  },
  "payload": {
    "kind": "agentTurn",
    "message": "Review recent sessions. Update daily logs. Create or update knowledge cards for durable decisions, corrections, and lessons. Skip trivial chatter.",
    "model": "openai-codex/gpt-5.5"
  },
  "delivery": { "mode": "none" },
  "sessionTarget": "isolated"
}

2. Handoff ingest

If other machines or Claude Code sessions emit memory handoffs, ingest them on a short cadence so they do not rot in a folder.

*/30 * * * * bash ~/.openclaw/workspace/scripts/run-memory-handoff-ingest.sh

3. Card staleness scan

This is the decay pass. It does not create new knowledge. It finds cards whose claims are aging out and queues them for refresh or manual review.

0 4 * * * python3 ~/.openclaw/workspace/scripts/card-decay-scanner.py

A clean division of labor helps:

JobPurposeTypical cadence
Memory sweepDistill recent work into durable memoryEvery 6-12 hours
Handoff ingestPull machine-local memory proposals into canonical storageEvery 15-30 minutes
Card staleness scanFlag aging or suspicious cards for refreshDaily

If you blur those together, you usually get one of two bad outcomes: either the sweep becomes bloated and slow, or stale cards sit around forever because nobody owns the decay loop.

Verification

# Check Ollama is running with embedding model
curl -s http://127.0.0.1:11434/api/tags | jq -r '.models[].name' | grep -i embed
# Expected: qwen3-embedding:8b (or whatever you configured)

# Check memory file structure
echo "=== Master Index ==="
wc -c ~/.openclaw/workspace/MEMORY.md

echo ""
echo "=== Knowledge Cards ==="
ls ~/.openclaw/workspace/memory/cards/ | wc -l
echo "cards"

echo ""
echo "=== Daily Logs ==="
ls ~/.openclaw/workspace/memory/20*.md 2>/dev/null | wc -l
echo "daily log files"

echo ""
echo "=== MEMORY.md Size Check ==="
SIZE=$(wc -c < ~/.openclaw/workspace/MEMORY.md)
if [ "$SIZE" -gt 4096 ]; then
  echo "⚠ MEMORY.md is ${SIZE} bytes. Consider trimming (target: <2KB)"
else
  echo "✓ MEMORY.md is ${SIZE} bytes (healthy)"
fi

Gotchas

  1. Local embeddings are more than good enough. qwen3-embedding:8b (5GB) or nomic-embed-text (274M / 1.6GB) both beat round-tripping to OpenAI’s embedding API for memory search. You need good enough relevance ranking, not SOTA - and the round-trip latency alone makes cloud embeddings a worse experience.

  2. Don’t load the backup. If you migrated from a monolithic MEMORY.md, the backup file might be 50-60KB. Never load it in a session. It exists for reference only.

  3. Card frontmatter matters for search. The tags and topic fields in YAML frontmatter improve semantic search accuracy. Don’t skip them.

  4. Memory search before answering. Make it a habit: before your agent answers questions about past decisions, dates, or people, it should search memory first. This catches things the agent “forgot” because they weren’t in today’s context.