How do you keep your AI agent skills from rotting silently?

Q: How does versioning work when skills are stored as memory documents?

Each skill is a document tagged with type:skill, topic: , and date:YYYY-MM-DD. When the skill needs updating, you don't edit the existing document — you write a new one with today's date. A search for type:skill + topic: returns every version sorted newest first, and you take the top one. The older versions stay searchable for audit, rollback, and how-did-this-procedure-evolve reviews — no destructive edits, no v1/v2 filename juggling.

Last updated: May 2026

Storing AI agent skills as files in a folder does not scale — Claude Code skills, your claude.md instructions, and every other slice of your Claude Code memory eventually rots, conflicts, or eats the context window. A better pattern is to store each skill as a document in a semantic memory store, tagged by topic and date. Searching by tag returns the latest version; older versions stay for audit and rollback. This post walks through the pattern with mesh-memory, an open-source implementation, and shows how to federate skills across multiple agents — Claude Code, Cursor, anything MCP-aware.

File-based skills vs memory-based skills

Approach	Versioning	Deprecation	Discoverability	Bloat at scale
Skills as files	Manual filename suffix (v1, v2)	Risky delete	Eager preload or directory walk	Linear, eats context
Skills as memory documents	Automatic via date tag	Free via supersession	One semantic search call	Bounded by query, not catalog

After running AI agents seriously for a while, take stock of what's accumulated. Skill files. Instruction files in nearly every subdirectory. A claude.md (or equivalent) that's tripled or quadrupled in size from where it started.

You built this gradually. Every time the agent got something wrong, you added a rule. Every time it forgot something useful, you wrote a memory file. Every repeating pattern, you turned into a skill.

It works — until you actually look at it. Then you see it: a system that grows linearly forever and has no safe way to remove anything.

This article is about a different pattern. It treats agent skills not as files in a folder, but as documents in a versioned memory store. I'll use mesh-memory — an open-source tool I built for this — as the concrete example, but the pattern works with anything that gives you semantic search over tagged documents.

The problem, concretely

Say you teach your agent how to clean up old Docker containers on a server. Six steps, two gotchas. You write a server-cleanup skill — a markdown file with the procedure, dropped into your Claude Code skills directory.

Three months later, the underlying CLI changes. The skill is still valid for 80% of cases, but step 4 needs updating. You have three options:

Edit the file in place. You lose the old version. If the new one breaks, there's no clean rollback.
Create server-cleanup-v2.md. Now you have two files. Which one does the agent use? Both? You add logic to pick. Filenames start carrying version semantics.
Don't touch it. The skill silently rots. Next time you use it, you waste twenty minutes debugging the change you never made.

None of these is good. The first loses history. The second creates a maintenance pipeline you didn't sign up for. The third creates dead code that's indistinguishable from live code.

Why file-based skills (and claude.md) fail at scale

There are two ways people typically wire up file-based skills — Anthropic's Skills feature in Claude Code, Cursor's command system, the rules section of your claude.md, or hand-rolled markdown files in any agent framework — and both fail differently.

Eager preload. Load every skill description into the agent's system prompt. Fast lookup, no extra reads. At 50 skills, that's 5-8K tokens eaten before any actual work starts. At 200 skills, you've consumed half your context window. And every single prompt pays this cost.

Lazy load through an index. Keep a flat index of skill names, only read the body when needed. Better on tokens, but now every task triggers one to three exploration reads. The agent reads the index, picks the wrong skill, reads another, reads the index again. I've watched it happen. It's not pretty.

Neither approach has a "gardener" mode. Nobody removes old skills. Nobody flags duplicates. Six months in you've got 70 skills, 30 are stale, 20 overlap — and there's no system to tell you which.

The pattern: skills as memory documents

Instead of files, store each skill as a document in a semantic memory store. Tag it:

type:skill
topic:server-cleanup
date:2026-05-22
version:1

The document body is the skill — preconditions, steps, postconditions, gotchas.

When you need the skill, search by tags:

mesh_bytag(tags=["type:skill", "topic:server-cleanup"])

You get every version, sorted newest first. Take the latest.

When the skill needs updating, you don't edit. You add a new document with today's date. The old one stays — visible in history, available for rollback — but no longer surfaces first.

What this solves

Versioning is automatic. Date in the tag, latest wins. No v1/v2/v3 filename juggling.

Deprecation is free. Old skills aren't deleted, just superseded. Zero risk of removing something that turns out to still be needed.

Discoverability scales. Semantic search across thousands of documents costs one query. No agent walks an index. No agent reads files it doesn't need.

Conflict detection. If two skills cover overlapping ground, your search returns both. Duplicates surface naturally.

Audit history is built in. Every version of every skill is in the store. Want to know how a procedure evolved? Filter by topic, sort by date — done.

Federation: one workspace per agent

This is where the pattern gets interesting.

If you run multiple agents — say one Claude Code instance for development, a separate one for ops, and a third for content — give each its own workspace in your memory store:

workspaces/
  dev/        skills + snapshots for development
  ops/        skills + snapshots for ops
  content/    skills + snapshots for content
  shared/     cross-agent knowledge

Each agent owns its own garden. The ops agent doesn't pollute the dev agent's skills, and vice versa. When a skill is genuinely shared — say a deploy routine that both agents use — you put it in the shared workspace and tag it with the project name.

This federation has a property file-based approaches don't: the gardener problem becomes tractable. Maintaining a centralized repo of 200 skills is nobody's job. Maintaining your own personal workspace of 30 skills is achievable.

A concrete example, end to end

Let's say you want your Claude Code agent to have a skill for a generic web app deploy. We'll use mesh-memory; the API works the same whether you call it from Python, an MCP client, or the CLI.

Step 1 — write the skill the first time:

mesh_add(
  content="""# Web App Deploy

Preconditions:
  - branch merged to main
  - release tag created

Steps:
  1. Pass deploy config via env, not from your shell profile.
     This keeps the deploy reproducible across machines.
  2. Allow ~10-15s for the app container to fully boot
     before smoke-testing the health endpoint.
  3. If you see 502s during the boot window, check container
     status before assuming a bug.

Postconditions:
  - All targeted instances return 200 on /health
""",
  tags=["type:skill", "topic:web-deploy", "date:2026-05-22"]
)

Step 2 — in your agent's instructions (for Claude Code, that's your claude.md), write a thin pointer:

When working on web-deploy:
  search mesh_bytag for type:skill + topic:web-deploy
  use the most recent version

That's it. No more rules in your global instructions. The agent fetches the live procedure when it needs it.

Step 3 — six months later, a fix lands. Write v2:

mesh_add(
  content="""# Web App Deploy v2

(step 2 now: container boot reduced to ~3s after image rebuild;
 502 grace window collapsed accordingly)
...
""",
  tags=["type:skill", "topic:web-deploy",
        "date:2026-11-15", "supersedes:doc_xyz"]
)

Same mesh_bytag call now returns v2 first. v1 is still there, still searchable, still rolled-back-to if needed. History preserved.

What this means if you're building agents

If you're noticing that your claude.md (or equivalent instructions file) is growing past the point where you can read it in one sitting — that's the signal. Skill files don't scale linearly. They scale catastrophically.

The pattern above is one way out. There are others. The key shift is treating accumulated agent memory as data, not as code — versioned, tagged, searchable, deprecatable through supersession rather than deletion.

You can try this today with mesh-memory: MIT-licensed, runs in Docker, ships with an MCP server (so Claude Code, Cursor, and other MCP-aware agents can talk to it directly), and gives you mesh_add / mesh_bytag / mesh_search out of the box.

Repository: github.com/dklymentiev/mesh-memory

Frequently asked questions

What are Claude Code skills? Skills are Anthropic's framework feature in Claude Code that lets you define reusable procedures the agent can invoke — for example, a deploy routine, a server cleanup checklist, or a debugging playbook. They are stored as markdown files (often alongside instruction blocks in your claude.md). Cursor and OpenAI's Agent SDK expose comparable abstractions under different names. The format varies; the accumulation problem is shared.

Why do file-based agent skills (and claude.md) fail at scale? There are two patterns and both break. Eager preload loads every skill into the system prompt — fine at 10 skills, but at 50 it consumes 5-8K tokens of the context window, and every prompt pays the cost. Lazy load through an index makes the agent walk a directory on each task, which triggers extra reads and wrong-skill selection. Neither approach has a gardener mode: nobody removes old skills, nobody detects duplicates, and stale procedures (or a growing claude.md) silently rot.

How does versioning work when skills are stored as memory documents? Each skill is a document tagged with type:skill, topic:<name>, and date:YYYY-MM-DD. When the skill needs updating, you don't edit the existing document — you write a new one with today's date. A search for type:skill + topic:<name> returns every version sorted newest first, and you take the top one. The older versions stay searchable for audit, rollback, and how-did-this-procedure-evolve reviews — no destructive edits, no v1/v2 filename juggling.

What is mesh-memory and do I have to use it? Mesh-memory is the open-source semantic memory store I built and use across my own agents. It is MIT-licensed, runs in Docker, ships with an MCP server so Claude Code and Cursor can talk to it directly, and exposes mesh_add / mesh_bytag / mesh_search primitives. The pattern in this article works with any tool that gives you semantic search over tagged documents — Weaviate, Pinecone, Postgres + pgvector, Qdrant. Mesh just happens to be tuned for this exact use case.

Does this work with Claude Code, Cursor, and other MCP clients? Yes. Mesh-memory exposes an MCP server, so any MCP-aware agent (Claude Code, Cursor, Zed, custom clients built with the Anthropic SDK or OpenAI Agents framework) can call the search and write primitives natively. The agent only needs a thin pointer in its instruction file — for example, in your claude.md write: "before working on a deploy, search mesh for type:skill + topic:deploy and use the latest version." That single rule replaces a folder full of brittle skill files.

If you've solved this differently — file an issue or drop a note. The pattern works, but there are likely angles worth comparing.