Authoring a Stack
A stack is a git repository that teaches an AI agent how to operate in a specific domain. When an agent reads your stack, it should become an expert operator — capable of deploying, managing, troubleshooting, and upgrading the target software.
Anatomy of a Stack
Every stack has these files at the root:
my-stack/
├── README.md # Repo landing page
├── CLAUDE.md # Agent entry point — persona, rules, routing
├── stack.yaml # Machine-readable manifest
└── skills/ # Operational knowledge, organized by phase
Scaffold one with:
agentic-stacks create my-org/my-stack
Step 1: Design the Skill Hierarchy
Skills are directories of markdown files that teach the agent specific operations. Organize by what the operator is trying to do:
| Phase | Purpose | Examples |
|---|---|---|
| Foundation | Understanding and setup | Architecture, configuration, provisioning |
| Deploy | Initial deployment | Bootstrap, networking, storage |
| Platform | Platform layer | GitOps, ingress, monitoring, security |
| Operations | Day-two management | Health checks, scaling, upgrades, backup |
| Diagnose | Troubleshooting | Symptom-based decision trees |
| Reference | Cross-cutting lookups | Known issues, compatibility, decision guides |
For complex stacks (10+ skills), use phase/domain nesting:
skills/
├── foundation/
│ ├── concepts/
│ └── infrastructure/
│ ├── README.md # Overview + index
│ ├── aws.md # Platform-specific
│ └── gcp.md
├── deploy/
│ ├── bootstrap/
│ ├── networking/
│ │ ├── README.md # Decision matrix
│ │ ├── cilium.md # Option deep dive
│ │ └── flannel.md
│ └── storage/
└── operations/
├── health-check/
├── upgrades/
└── backup-restore/
Step 2: Write CLAUDE.md
CLAUDE.md is the agent's brain. It sets identity, enforces safety, and routes to skills.
# [Stack Name] — Agentic Stack
## Identity
[1-2 sentences establishing the agent's expertise]
## Critical Rules
[Numbered list of hard safety guardrails]
## Routing Table
| Operator Need | Skill | Entry Point |
|---|---|---|
| Deploy the cluster | bootstrap | skills/deploy/bootstrap |
| Troubleshoot issues | troubleshooting | skills/diagnose/troubleshooting |
## Workflows
### New Deployment
[Linear path through skills for first-time setup]
### Existing Deployment
[How to jump to the right skill for ongoing operations]
Writing Critical Rules
Critical rules prevent the agent from doing damage. Good rules are:
- Specific: "Never run
talosctl resetwithout operator approval" not "be careful" - Actionable: the agent can check compliance unambiguously
- Justified: explain why — "etcd quorum loss means cluster down"
- Minimal: 5-10 rules. Too many and the agent ignores them.
Step 3: Write stack.yaml
name: my-stack
owner: my-org
version: 0.1.0
description: >
One paragraph describing what this stack teaches agents to operate.
repository: https://github.com/my-org/my-stack
target:
software: target-software-name
versions: ["1.x"]
skills:
- name: skill-name
entry: skills/path/to/skill
description: One-line description
project:
structure:
- file-or-dir-in-operator-project
requires:
tools:
- name: tool-name
description: What it's used for
depends_on: []
Tips: entry points to a directory, not a file. The directory's README.md is the entry point. description should help an agent decide whether to read the skill.
Step 4: Research and Verify
A stack is only as good as its accuracy. Before writing any skill:
- Fetch the target software's official documentation index (
/llms.txt,/sitemap.xml, or GitHub source) - Copy exact commands from the docs — do not reconstruct from memory
- Verify YAML field names, CLI flags, and config structure
- Note version-specific behavior
- Cross-reference with release notes and GitHub issues
Step 5: Write Skill Content
Optimize for how agents process information:
- Imperative headings: "Install Cilium", "Verify Health" — not "About Cilium Installation"
- Exact commands: full copy-pasteable commands with realistic example values
- Decision trees: "If X fails -> check Y -> if Y is true -> do Z"
- Tables for reference: comparison matrices, port requirements
- Safety warnings: explicit callouts before any destructive operation
- Full YAML/config examples: valid snippets, not fragments
Known Issues Pattern
Version-specific bugs get their own files in skills/reference/known-issues/:
### [Short Description]
**Symptom:** What the operator sees
**Cause:** Why it happens
**Workaround:** Exact steps to fix it
**Affected versions:** x.y.z through x.y.w
**Status:** Open / Fixed in x.y.w
Step 6: Decision Guides and Compatibility
For stacks where operators must choose between components, provide structured decision aids in skills/reference/decision-guides/:
- Comparison tables with features, complexity, and performance
- Recommendations by use case (production, development, cloud-native)
- Migration paths — can you change this decision later?
And compatibility matrices in skills/reference/compatibility/ mapping which versions of components work together.
Step 7: Validate Your Stack
agentic-stacks doctor
Before publishing, check:
- CLAUDE.md has identity, critical rules, routing table, and workflows
- stack.yaml lists all skills with correct entry paths
- Every skill directory has a README.md
- All commands are exact and copy-pasteable
- No placeholders (TBD, TODO, FIXME) remain
- Known issues are documented for supported versions
- The stack has been tested by having an agent use it end-to-end
Designing for Composition
Operators compose multiple stacks in a single project. To make your stack compose well:
- Stay in your domain. A hardware stack shouldn't reimplement networking concepts that a platform stack covers.
- Use
depends_onto declare stacks that pair well with yours. - Avoid conflicting file outputs. Document what files your stack creates in
project.structure. - Name skills distinctively. When an agent loads multiple stacks, skill names should make the domain clear.
Reference Implementations
| Stack | Complexity | Pattern |
|---|---|---|
| openstack-kolla | Simple | Flat phase-based (8 skills) |
| kubernetes-talos | Comprehensive | Two-layer phase/domain (20 skills) |