v2 · Open Source · MIT License · Claude Cowork · OpenClaw

Skill Forge

An autonomous experiment loop that makes AI skills and codebases better while you sleep. Works with Claude Cowork and OpenClaw. Run it fully autonomous or step through each decision in guided mode.

Wizard Dry-Run Hypothesize Mutate Evaluate Score Keep / Revert
Setup Wizard with Validation Gates → Experiment Loop → Morning Report

Two Domains, Two Execution Modes

Optimize SKILL.md files or any codebase. Run fully autonomous or step through each decision.

Skill Mode

Optimize AI Skills

Improve a Skill's SKILL.md by running evals, grading outputs, and scoring with a composite metric. Works with Claude Cowork and OpenClaw skills.

  • Assertion-based scoring
  • Optional LLM-as-Judge
  • Train/test split for overfitting protection
Generic Mode

Optimize Any Metric

Improve any file against any shell command that returns a number: test coverage, bundle size, Lighthouse, Docker image size, lint errors.

  • Any shell command as metric
  • Higher-is-better or lower-is-better
  • Glob pattern for scope
Auto Mode

Run Unattended

Fully autonomous loop. Start it, go to sleep, wake up to a morning report with measurable improvements.

  • Perfect for Scheduled Tasks
  • Automatic keep/revert decisions
  • No user interaction needed
Guided Mode

Decide at Every Step

Interactive loop with 5 checkpoints where you review, adjust, and approve each experiment before it runs.

  • Review and edit evals
  • Approve or redirect hypotheses
  • Override keep/revert decisions

Guided Mode: 5 Checkpoints

In guided mode, you stay in control while the agents do the heavy lifting.

1

Review Evals

See generated test cases, adjust assertions, change weights, add your own edge cases.

2

Approve Hypothesis

The agent proposes what to change and why. Accept, modify, or provide your own direction.

3

Review Mutation Diff

See exactly what will change before it's applied. Edit, approve, or skip.

4

Evaluate Results

See the score before and after. Override the automatic recommendation if you know better.

5

Continue or Stop

See progress so far and decide: another round, N more rounds, or generate the final report.

Three Agents, One Loop

Each experiment cycle is driven by three specialized agents working in sequence.

🔬

Hypothesis Agent

The Scientist

Analyzes failures, consults the coverage matrix, and produces a testable hypothesis targeting the highest-impact root cause.

🩹

Mutator Agent

The Surgeon

Applies one minimal, focused change. Works on SKILL.md instructions or code files. Documents every change with category tags.

Scorer Agent

The Judge

LLM-as-Judge evaluating output quality across four dimensions. In Generic Mode, the shell command provides the score directly.

Objective Scoring

Every mutation is measured against a composite metric. No vibes, no gut feeling — just numbers.

Skill Mode — Composite Score
assertion_pass_rate × 0.80  +  efficiency_score × 0.20
Skill Mode — With LLM-as-Judge
assertions × 0.50  +  llm_judge × 0.30  +  efficiency × 0.20
Generic Mode — Direct Metric
shell_command → extract_number → compare_to_baseline → keep/revert

Real-World Results

Tested on production skills with measurable before/after improvements.

humanizer

Text humanization · 3 experiments
Before
0.74
After
0.90
+21.6% — Personality as a dedicated workflow step with concrete criteria.

fachbuch-lektorat

German technical book editing · 3 experiments
Before
87%
After
100%
Worked examples beat abstract rules for edge-case handling.

was-bisher-geschah

AI news briefing · 1 experiment
Before
93%
After
100%
Explicit constraints (char limits, required fields) fix format drift.

Built to be Robust

Production-grade safeguards so you can run it unattended overnight.

Setup Wizard

6-step wizard with validation gates. Each step must pass before proceeding.

Dry-Run Gate

Mandatory pre-flight check. The loop only starts if evals and metrics actually work.

Train/Test Split

60/40 split prevents overfitting to specific eval cases.

Coverage Matrix

Tracks which improvement categories have been tried, with saturation detection.

TSV Experiment Log

One-line-per-experiment flat log. Quick monitoring with tail, grep, awk.

Crash Recovery

Classifies errors, retries once, then skips. 3 consecutive crashes pause the loop.

Exploration-Exploitation

Early rounds explore broadly, late rounds focus on categories with the best results.

Eval Rotation

Fresh eval queries after 5 experiments to maintain diversity.

Quick Start

Terminal
# Clone the repo
git clone https://github.com/GodModeAI2025/skill-forge.git

# Copy into your skills folder (Cowork or OpenClaw)
cp -r skill-forge/ ~/.skills/skills/skill-forge/

# Auto mode: let it run overnight
"Use skill-forge to improve my linkedin-content skill"

# Guided mode: decide at every step
"Use skill-forge in guided mode to improve my humanizer skill"

# Generic mode: optimize any metric
"Use skill-forge to optimize train.py — metric: python train.py --eval"

Works with your platform

Skill Forge uses the standard SKILL.md format and runs wherever AI skills are supported.

Claude Cowork
OpenClaw
Any SKILL.md host