An autonomous experiment loop that makes AI skills and codebases better while you sleep. Works with Claude Cowork and OpenClaw. Run it fully autonomous or step through each decision in guided mode.
Optimize SKILL.md files or any codebase. Run fully autonomous or step through each decision.
Improve a Skill's SKILL.md by running evals, grading outputs, and scoring with a composite metric. Works with Claude Cowork and OpenClaw skills.
Improve any file against any shell command that returns a number: test coverage, bundle size, Lighthouse, Docker image size, lint errors.
Fully autonomous loop. Start it, go to sleep, wake up to a morning report with measurable improvements.
Interactive loop with 5 checkpoints where you review, adjust, and approve each experiment before it runs.
In guided mode, you stay in control while the agents do the heavy lifting.
See generated test cases, adjust assertions, change weights, add your own edge cases.
The agent proposes what to change and why. Accept, modify, or provide your own direction.
See exactly what will change before it's applied. Edit, approve, or skip.
See the score before and after. Override the automatic recommendation if you know better.
See progress so far and decide: another round, N more rounds, or generate the final report.
Each experiment cycle is driven by three specialized agents working in sequence.
Analyzes failures, consults the coverage matrix, and produces a testable hypothesis targeting the highest-impact root cause.
Applies one minimal, focused change. Works on SKILL.md instructions or code files. Documents every change with category tags.
LLM-as-Judge evaluating output quality across four dimensions. In Generic Mode, the shell command provides the score directly.
Every mutation is measured against a composite metric. No vibes, no gut feeling — just numbers.
Tested on production skills with measurable before/after improvements.
Production-grade safeguards so you can run it unattended overnight.
6-step wizard with validation gates. Each step must pass before proceeding.
Mandatory pre-flight check. The loop only starts if evals and metrics actually work.
60/40 split prevents overfitting to specific eval cases.
Tracks which improvement categories have been tried, with saturation detection.
One-line-per-experiment flat log. Quick monitoring with tail, grep, awk.
Classifies errors, retries once, then skips. 3 consecutive crashes pause the loop.
Early rounds explore broadly, late rounds focus on categories with the best results.
Fresh eval queries after 5 experiments to maintain diversity.
Skill Forge uses the standard SKILL.md format and runs wherever AI skills are supported.