Skill Forge v2 — Autonomous AI Skill & Code Improvement

Two Domains, Two Execution Modes

Optimize SKILL.md files or any codebase. Run fully autonomous or step through each decision.

Skill Mode

Optimize AI Skills

Improve a Skill's SKILL.md by running evals, grading outputs, and scoring with a composite metric. Works with Claude Cowork and OpenClaw skills.

Assertion-based scoring
Optional LLM-as-Judge
Train/test split for overfitting protection

Generic Mode

Optimize Any Metric

Improve any file against any shell command that returns a number: test coverage, bundle size, Lighthouse, Docker image size, lint errors.

Any shell command as metric
Higher-is-better or lower-is-better
Glob pattern for scope

Auto Mode

Run Unattended

Fully autonomous loop. Start it, go to sleep, wake up to a morning report with measurable improvements.

Perfect for Scheduled Tasks
Automatic keep/revert decisions
No user interaction needed

Guided Mode

Decide at Every Step

Interactive loop with 5 checkpoints where you review, adjust, and approve each experiment before it runs.

Review and edit evals
Approve or redirect hypotheses
Override keep/revert decisions

Guided Mode: 5 Checkpoints

In guided mode, you stay in control while the agents do the heavy lifting.

Review Evals

See generated test cases, adjust assertions, change weights, add your own edge cases.

Approve Hypothesis

The agent proposes what to change and why. Accept, modify, or provide your own direction.

Review Mutation Diff

See exactly what will change before it's applied. Edit, approve, or skip.

Evaluate Results

See the score before and after. Override the automatic recommendation if you know better.

Continue or Stop

See progress so far and decide: another round, N more rounds, or generate the final report.

Three Agents, One Loop

Each experiment cycle is driven by three specialized agents working in sequence.

🔬

Hypothesis Agent

The Scientist

Analyzes failures, consults the coverage matrix, and produces a testable hypothesis targeting the highest-impact root cause.

🩹

Mutator Agent

The Surgeon

Applies one minimal, focused change. Works on SKILL.md instructions or code files. Documents every change with category tags.

⚖

Scorer Agent

The Judge

LLM-as-Judge evaluating output quality across four dimensions. In Generic Mode, the shell command provides the score directly.

Objective Scoring

Every mutation is measured against a composite metric. No vibes, no gut feeling — just numbers.

Skill Mode — Composite Score

assertion_pass_rate × 0.80 + efficiency_score × 0.20

Skill Mode — With LLM-as-Judge

assertions × 0.50 + llm_judge × 0.30 + efficiency × 0.20

Generic Mode — Direct Metric

shell_command → extract_number → compare_to_baseline → keep/revert

Real-World Results

Tested on production skills with measurable before/after improvements.

humanizer

Text humanization · 3 experiments

Before

0.74

After

0.90

+21.6% — Personality as a dedicated workflow step with concrete criteria.

fachbuch-lektorat

German technical book editing · 3 experiments

Before

87%

After

100%

Worked examples beat abstract rules for edge-case handling.

was-bisher-geschah

AI news briefing · 1 experiment

Before

93%

After

100%

Explicit constraints (char limits, required fields) fix format drift.

Built to be Robust

Production-grade safeguards so you can run it unattended overnight.

Setup Wizard

6-step wizard with validation gates. Each step must pass before proceeding.

Dry-Run Gate

Mandatory pre-flight check. The loop only starts if evals and metrics actually work.

Train/Test Split

60/40 split prevents overfitting to specific eval cases.

Coverage Matrix

Tracks which improvement categories have been tried, with saturation detection.

TSV Experiment Log

One-line-per-experiment flat log. Quick monitoring with tail, grep, awk.

Crash Recovery

Classifies errors, retries once, then skips. 3 consecutive crashes pause the loop.

Exploration-Exploitation

Early rounds explore broadly, late rounds focus on categories with the best results.

Eval Rotation

Fresh eval queries after 5 experiments to maintain diversity.

Quick Start

Terminal

        # Clone the repo

        git clone https://github.com/GodModeAI2025/skill-forge.git

        # Copy into your skills folder (Cowork or OpenClaw)

        cp -r skill-forge/ ~/.skills/skills/skill-forge/

        # Auto mode: let it run overnight

        "Use skill-forge to improve my linkedin-content skill"

        # Guided mode: decide at every step

        "Use skill-forge in guided mode to improve my humanizer skill"

        # Generic mode: optimize any metric

        "Use skill-forge to optimize train.py — metric: python train.py --eval"

Two Domains, Two Execution Modes

Optimize AI Skills

Optimize Any Metric

Run Unattended

Decide at Every Step

Guided Mode: 5 Checkpoints

Review Evals

Approve Hypothesis

Review Mutation Diff

Evaluate Results

Continue or Stop

Three Agents, One Loop

Hypothesis Agent

Mutator Agent

Scorer Agent

Objective Scoring

Real-World Results

humanizer

fachbuch-lektorat

was-bisher-geschah

Built to be Robust

Setup Wizard

Dry-Run Gate

Train/Test Split

Coverage Matrix

TSV Experiment Log

Crash Recovery

Exploration-Exploitation

Eval Rotation

Quick Start

Works with your platform