Publish-Time Scrubbing
Scrub at the moment artifacts leave the private workspace, not after they are already public.
What this is
Publish-time scrubbing is the outbound boundary for a private agent stack. It catches private hostnames, local endpoints, filesystem paths, operator identifiers, tokens, and account details before notes, guides, drafts, exports, or commits leave the host.
The rule is simple: private workspaces can contain private facts, public artifacts cannot. Scrub and scan the artifact at the boundary where it moves from private to public.
Why this way
Trying to scrub every private chat message makes the agent worse. Sometimes the operator needs the exact service name, path, port, or hostname inside a private session. Scrubbing there destroys useful context and creates false confidence.
Waiting until after publish is worse. Once a repo, post, release note, screenshot, or exported bundle is public, cleanup becomes incident response.
The reliable middle ground is a publish boundary:
- private workspace stays useful
- staged artifact gets deterministic replacements
- scanner blocks known leak classes
- human reviews the scrubbed diff
- release proceeds only after the artifact is public-safe
This is a workflow, not a prompt. The model may help write, but deterministic tools decide what crosses the boundary.
Prerequisites
- A deterministic scrubber such as
../templates/scrubbers/scrub-content.sh - A scanner such as content-guard
- A staging directory, git branch, draft folder, or export bundle that represents the outbound artifact
- A replacement policy for each leak class
- Someone willing to review the scrubbed diff before publishing
Before / After
Before:
- Private notes are copied into public docs by hand.
- Screenshots and exports may include paths, account labels, or internal endpoints.
- Scrubbing happens as a vague “look it over” step.
- Git hooks catch some issues, but only when the artifact is already in a repo.
- A model rewrite can remove too much or miss exact identifiers.
After:
- Every public artifact passes through a staging boundary.
- Regex replacements are previewed before being applied.
- content-guard blocks hard leaks and reports warnings.
- Human review checks meaning, tone, and false positives.
- The final artifact is safe enough to commit, publish, attach, or send.
Implementation
1. Define the artifact boundary
Pick one folder or branch where content becomes publishable. The boundary should be boring and visible:
drafts/private/ source notes, transcripts, raw agent output
staging/public/ scrubbed artifact, ready for review
published/ final copies, release bundle, or committed files
For a git repo, the boundary can be the working tree plus pre-push hook. For blog posts, newsletters, screenshots, or exported PDFs, use a staging folder before upload.
The point is to avoid asking “did we scrub this?” after the fact. If it leaves through staging/public/, it gets scrubbed.
2. Choose leak classes
Start with leak classes that have mechanical patterns:
| Leak class | Examples to catch | Replacement |
|---|---|---|
| private infrastructure | hostnames, private service names, local endpoints | [redacted-service] |
| network details | private IPs, loopback URLs, internal ports | [redacted-endpoint] |
| operator identity | emails, handles, account names, phone numbers | [redacted-identity] |
| filesystem paths | home paths, repo paths, auth paths, profile paths | [redacted-path] |
| channels and chat targets | channel IDs, webhook names, routing labels | [redacted-channel] |
| secrets | tokens, keys, cookies, auth headers | [redacted-secret] |
Do not make one giant replacement. Stable placeholders preserve meaning while hiding the private detail.
3. Keep rules deterministic
Use explicit rules, not a model rewrite pass, for first-line scrubbing. The template rule file is tab-separated:
# pattern replacement
user@[A-Za-z0-9._-]+ [redacted-target]
channel:[A-Za-z0-9_-]+ [redacted-channel]
/home/[A-Za-z0-9._-]+/[A-Za-z0-9._/-]+ [redacted-path]
Project-specific rules should live outside public templates when they contain private terms:
export SCRUB_RULES="$HOME/.config/publication-scrub/rules.tsv"
Keep the public template generic. Keep private names in local config.
4. Preview before applying
Run the scrubber in preview mode first:
templates/scrubbers/scrub-content.sh staging/public/
Review the diff. If it is correct, apply it:
templates/scrubbers/scrub-content.sh --apply staging/public/
Then review the git diff or file diff again. A scrubber that destroys meaning is a different kind of bug.
5. Run the scanner after the scrubber
The scrubber normalizes known patterns. The scanner catches the rest.
PYTHONPATH="$CONTENT_GUARD_DIR/src" \
python3 -m content_guard scan "$PWD" \
--policy "$CONTENT_GUARD_DIR/policies/public-repo.json"
Use three scanner result classes:
| Result | Action |
|---|---|
| blocker | fix before publish |
| warning | review, then fix or allow with a narrow comment |
| clean | proceed to human review |
Allow comments should be rare and local to the line. Prefer rewriting examples with placeholders.
6. Add a git pre-push gate
For public repos, run the scanner again at push time:
git config core.hooksPath hooks
The pre-push hook is not a substitute for staging-time scrubbing. It is the final guardrail for mistakes that slipped through.
7. Scrub screenshots and generated media
Text scanners do not protect screenshots, PDFs, image exports, browser captures, or terminal recordings.
Use a manual media checklist before publishing:
- browser address bar hidden or generic
- account switchers and profile avatars hidden
- terminal prompt does not reveal host or user
- visible paths are generic
- chat channel names and IDs are hidden
- QR codes, invite links, and tokens are absent
- image alt text does not reintroduce the private detail
If an image needs real UI structure, redraw the sensitive parts or use a generated public-safe mock. Blurring can be recoverable or visually sloppy; replacement is better.
8. Keep a publish log
For repeated release flows, write a short log entry:
artifact: cookbook guide
source: drafts/private/browser-stack-notes.md
staged: staging/public/browser-stack.md
scrubber: passed
scanner: passed, warnings reviewed
media: not applicable
published: public repo commit
The log is not for bureaucracy. It makes leaks easier to investigate because you can see which boundary ran and when.
Verification
Run the scrubber fixture:
templates/scrubbers/scrub-content.sh templates/scrubbers/fixtures/input.txt
Expected result: the command previews replacements and leaves the fixture unchanged.
Validate the rule file has tab-separated pairs:
awk -F '\t' 'NF && $1 !~ /^#/ && NF != 2 { print "bad rule:", NR; bad=1 } END { exit bad }' \
templates/scrubbers/rules.example.tsv
Run content-guard against the repo:
PYTHONPATH="$CONTENT_GUARD_DIR/src" \
python3 -m content_guard scan "$PWD" \
--policy "$CONTENT_GUARD_DIR/policies/public-repo.json"
Check the git hook is active:
git config --get core.hooksPath
test -x hooks/pre-push
For a staged artifact, the minimum publish gate is:
templates/scrubbers/scrub-content.sh staging/public/
PYTHONPATH="$CONTENT_GUARD_DIR/src" python3 -m content_guard scan staging/public/ --policy "$CONTENT_GUARD_DIR/policies/public-repo.json"
Expected result: no blockers, warnings reviewed, and no unreviewed private identifiers in the final diff.
Gotchas
-
Scrub at the artifact boundary, not every chat reply. Private operator messages often need exact local details. Scrubbing them makes the agent less useful and still does not prove the exported artifact is safe.
-
Generated screenshots are artifacts too. A perfect markdown scrub does nothing for an image that shows a terminal prompt, account dropdown, or real channel name.
-
Model rewrites are not scrubbers. A model can help make prose public-safe, but deterministic rules and scanners should decide whether known leak patterns remain.
-
Warnings need ownership. A scanner warning that everyone ignores is just decorative noise. Either rewrite the line, add a narrow allow comment, or tune the policy.
-
Do not commit private scrub rules. Public templates should show generic patterns. Real hostnames, account names, and service labels belong in local config.
-
Redaction can break examples. Replace with meaningful placeholders, not blank strings. The public reader still needs to understand what kind of value goes there.
-
Pre-push is late. It protects the remote, not the blog editor, chat upload, newsletter draft, or release bundle. Run scrubbers before the artifact reaches any downstream system.
Templates
../templates/scrubbers/scrub-content.sh- deterministic scrubber with preview and apply modes../templates/scrubbers/rules.example.tsv- public-safe example rule file../templates/scrubbers/fixtures/- fixture shape for testing replacement behavior../templates/hooks/pre-push- final git boundary guard
Related
../automation/hooks.md- where publish-boundary hooks fit in the three-layer hook model../security/agent-security-hardening.md- defense in depth for agents with real system access../ai-stack/browser-llm-stack.md- browser artifacts and screenshot review../ai-stack/skills-development.md- sanitizing private skills into public reusable patterns../CONTRIBUTING.md- repo-level hostname scrub rule