Getting Started with Skill Forge

Your AI assistant gets better overnight — all on its own. Learn how to use Skill Forge.

Autonomous Improvement No Coding Required Run It Overnight

↓ Scroll down or use arrow keys

01

What is Skill Forge?

Your personal improvement engine for AI skills and code

The Core Idea

Imagine you've written instructions for your AI — a so-called Skill. It works okay, but not perfectly.

Skill Forge takes that skill, tests it automatically, finds weaknesses, and improves it — step by step. And the best part: you can let it run overnight.

Like a Workshop for AI

1
Analyze
Find the weakness
→
2
Improve
Apply a targeted change
→
3
Test
Measure if it's better
→
4
Decide
Keep or revert
💡 Key Insight
This principle comes from AI researcher Andrej Karpathy: automatically test small changes, only keep improvements. Skill Forge applies this to your work.

What Skill Forge Has Already Achieved

âœī¸
Humanizer Skill
From 74% to 90% quality in just 3 experiments. Texts now sound more natural and human.
📖
Book Editing
From 87% to 100% — all test cases passed after 3 rounds of automatic improvement.
📰
AI News Briefing
From 93% to 100% in a single experiment. LinkedIn character limit was automatically enforced.
What happens when a change makes the skill worse?
The change stays anyway
The change is reverted
The entire skill is deleted
Exactly! Skill Forge automatically reverts any regression. Your skill can only get better, never worse.
Not quite. Skill Forge has a safety net: any worsening is automatically reverted. Your original is always safe.
02

The Two Worlds

Skill Forge has two modes — and you choose how much control you want

What Do You Want to Improve?

đŸŽ¯
Skill Mode
Optimizes a SKILL.md — the instructions that control your AI. Ideal when you want to improve a Claude skill.
âš™ī¸
Generic Mode
Optimizes any files against a measurable number — e.g. test coverage, file size, or performance score. For developer projects.
â„šī¸ Automatic Detection
You don't need to choose the mode yourself. Mention a skill name and Forge picks Skill Mode. Mention a metric and it picks Generic Mode.

How Much Control Do You Want?

🚀 Auto ModeđŸŽ›ī¸ Guided Mode
FlowRuns on its ownYou decide at each step
Ideal forRunning overnightFirst-time use
HypothesesApplied automaticallyYou see the proposal first
ResultRead the report in the morningExperience it step by step

Three Helpers Behind the Scenes

Skill Forge uses three specialized Agents that work together:

đŸ”Ŧ
The Scientist
Analyzes what's going wrong and forms a hypothesis: "I believe the problem is here."
🔧
The Surgeon
Implements the hypothesis as a minimal change. Change as little as possible, as much as necessary.
âš–ī¸
The Judge
Evaluates the result objectively. Was the change an improvement? Or did it make things worse?
You want to improve your LinkedIn post skill overnight. Which mode do you use?
Generic Mode, Guided
Skill Mode, Auto
Generic Mode, Auto
Skill Mode, Guided
Correct! A LinkedIn post skill is a SKILL.md → Skill Mode. Running overnight → Auto Mode. Perfect combo.
Close! Since it's a skill (SKILL.md) → Skill Mode. Since it runs overnight → Auto Mode. So: Skill Mode, Auto.
03

The Experiment Loop

The core feature step by step — click through the loop

How a Round Works

Every improvement round follows the same pattern. Click a step to learn more:

🧙 Wizard
đŸ”Ŧ Hypothesis
🔧 Mutation
📊 Measure
âš–ī¸ Keep/Revert
Setup Wizard: Before anything starts, an assistant walks you through 6 simple steps: choose mode, set target, create test cases, and run a trial.

The Wizard — Your Starting Point

Before the loop starts, the Wizard sets everything up:

6 Steps to Launch
1
Choose Mode
Auto or Guided? Skill or Generic?
2
Find Target
Which file should be improved?
3
Create Test Cases
3–5 automatic tests, split into training and validation
4
Set Direction
Higher is better? Or lower?
5
Dry Run
A trial run to make sure everything works
6
Confirm & Launch
Everything looks good? Let's go!

When Does the Loop Stop?

đŸŽ¯
Goal Reached
Score of 95% or higher — the skill is excellent.
🔄
Limit Hit
10 experiments completed (or your custom limit).
📈
Plateau Detected
3 consecutive rounds without improvement? Forge recognizes the optimum has been reached.
Why does Skill Forge run a "Dry Run" before the actual loop?
To already improve the skill once
To show you the result in advance
To ensure the measurement works
Exactly! The dry run checks that tests run correctly and a score can be calculated. This prevents errors in the overnight run.
Not quite. The dry run is a safety check: it ensures the measurement works before the loop starts. This keeps everything running smoothly.
04

Tips & Settings

Get the most out of Skill Forge

Quick Start — Three Sentences to Go

đŸ’Ŧ
Just say:
"Use Skill Forge to improve my linkedin-content skill"
đŸŽ›ī¸
With control:
"Use Skill Forge in guided mode for my humanizer skill — I want to decide at each step"

Key Settings

You can adjust these values at start — or use the defaults:

Experiments
Maximum 10 rounds (default). Increase to 15–20 for complex skills.
Time Budget
120 minutes (default). Set to 360+ for overnight runs.
Threshold
A change needs at least 2% improvement to be kept.
Crash Limit
After 3 consecutive errors, the loop stops. Prevents endless loops.

The Morning Report

After the overnight run, you'll find a summary of everything that happened:

What's in the Report
📊
Score Progression
Where you started and where you are now
✅
Top Improvements
The 3 most impactful changes in detail
❌
What Didn't Work
Failed hypotheses — and why
💡
Recommendations
What you could do next
đŸ›Ąī¸ Safety Net
Skill Forge creates a backup (Snapshot) before every change. Even after 10 experiments, you can always return to the original.
05

Frequently Asked Questions

Answers to the most important questions — and what to do when things go wrong

The Key Questions

▸ Can Skill Forge make my skill worse?
No. Every change that lowers the score is automatically reverted. Plus, snapshots let you return to the original state at any time.
▸ What is "overfitting" and why should I care?
Overfitting would mean your skill only aces the test cases but fails on new tasks. Skill Forge protects against this with a train/test split: 60% of tests optimize, 40% verify blindly.
▸ How long does a run take?
Each individual experiment takes 5–15 minutes. A full run with 10 experiments typically takes 1–2 hours. You can adjust the time budget in the wizard.
▸ What do I need to get started?
Just a Claude Cowork account and the skill-forge folder in your skills directory. Then simply say: "Use Skill Forge to improve my [skill-name]" — and the wizard guides you through the rest.
▸ What if the loop crashes?
Skill Forge has built-in crash detection. Individual errors are skipped. Only after 3 consecutive failures does the loop stop and report the issue. Your skill always stays safe.
What does the "Train/Test Split" in Skill Forge protect against?
Data loss during crashes
The skill being optimized only for the test cases
Excessive runtime
Right! The split ensures 40% of tests remain blind — verifying that improvements also work on new, unseen tasks.
Not quite. The train/test split protects against overfitting: 60% of tests optimize, 40% remain blind — ensuring improvements generalize.

Ready to Get Started?

Copy the skill-forge/ folder into your skills directory and tell Claude:

"Use Skill Forge to improve my
[skill-name] skill"
🎉 You're All Set!
You now know everything you need. The wizard handles the rest. Good luck with your improved skill!