Getting Started with Skill Forge

Your AI assistant gets better overnight — all on its own. Learn how to use Skill Forge.

Autonomous Improvement No Coding Required Run It Overnight

↓ Scroll down or use arrow keys

01

What is Skill Forge?

Your personal improvement engine for AI skills and code

The Core Idea

Imagine you've written instructions for your AI — a so-called Skill. It works okay, but not perfectly.

Skill Forge takes that skill, tests it automatically, finds weaknesses, and improves it — step by step. And the best part: you can let it run overnight.

Like a Workshop for AI

1

Analyze

Find the weakness

→

2

Improve

Apply a targeted change

→

3

Test

Measure if it's better

→

4

Decide

Keep or revert

💡 Key Insight

This principle comes from AI researcher Andrej Karpathy: automatically test small changes, only keep improvements. Skill Forge applies this to your work.

What Skill Forge Has Already Achieved

✍️

Humanizer Skill

From 74% to 90% quality in just 3 experiments. Texts now sound more natural and human.

📖

Book Editing

From 87% to 100% — all test cases passed after 3 rounds of automatic improvement.

📰

AI News Briefing

From 93% to 100% in a single experiment. LinkedIn character limit was automatically enforced.

What happens when a change makes the skill worse?

The change stays anyway

The change is reverted

The entire skill is deleted

Exactly! Skill Forge automatically reverts any regression. Your skill can only get better, never worse.

Not quite. Skill Forge has a safety net: any worsening is automatically reverted. Your original is always safe.

02

The Two Worlds

Skill Forge has two modes — and you choose how much control you want

What Do You Want to Improve?

🎯

Skill Mode

Optimizes a SKILL.md — the instructions that control your AI. Ideal when you want to improve a Claude skill.

⚙️

Generic Mode

Optimizes any files against a measurable number — e.g. test coverage, file size, or performance score. For developer projects.

ℹ️ Automatic Detection

You don't need to choose the mode yourself. Mention a skill name and Forge picks Skill Mode. Mention a metric and it picks Generic Mode.

How Much Control Do You Want?

	🚀 Auto Mode	🎛️ Guided Mode
Flow	Runs on its own	You decide at each step
Ideal for	Running overnight	First-time use
Hypotheses	Applied automatically	You see the proposal first
Result	Read the report in the morning	Experience it step by step

Three Helpers Behind the Scenes

Skill Forge uses three specialized Agents that work together:

🔬

The Scientist

Analyzes what's going wrong and forms a hypothesis: "I believe the problem is here."

🔧

The Surgeon

Implements the hypothesis as a minimal change. Change as little as possible, as much as necessary.

⚖️

The Judge

Evaluates the result objectively. Was the change an improvement? Or did it make things worse?

You want to improve your LinkedIn post skill overnight. Which mode do you use?

Generic Mode, Guided

Skill Mode, Auto

Generic Mode, Auto

Skill Mode, Guided

Correct! A LinkedIn post skill is a SKILL.md → Skill Mode. Running overnight → Auto Mode. Perfect combo.

Close! Since it's a skill (SKILL.md) → Skill Mode. Since it runs overnight → Auto Mode. So: Skill Mode, Auto.

03

The Experiment Loop

The core feature step by step — click through the loop

How a Round Works

Every improvement round follows the same pattern. Click a step to learn more:

🧙 Wizard

🔬 Hypothesis

🔧 Mutation

📊 Measure

⚖️ Keep/Revert

Setup Wizard: Before anything starts, an assistant walks you through 6 simple steps: choose mode, set target, create test cases, and run a trial.

The Wizard — Your Starting Point

Before the loop starts, the Wizard sets everything up:

6 Steps to Launch

1

Choose Mode

Auto or Guided? Skill or Generic?

2

Find Target

Which file should be improved?

3

Create Test Cases

3–5 automatic tests, split into training and validation

4

Set Direction

Higher is better? Or lower?

5

Dry Run

A trial run to make sure everything works

6

Confirm & Launch

Everything looks good? Let's go!

When Does the Loop Stop?

🎯

Goal Reached

Score of 95% or higher — the skill is excellent.

🔄

Limit Hit

10 experiments completed (or your custom limit).

📈

Plateau Detected

3 consecutive rounds without improvement? Forge recognizes the optimum has been reached.

Why does Skill Forge run a "Dry Run" before the actual loop?

To already improve the skill once

To show you the result in advance

To ensure the measurement works

Exactly! The dry run checks that tests run correctly and a score can be calculated. This prevents errors in the overnight run.

Not quite. The dry run is a safety check: it ensures the measurement works before the loop starts. This keeps everything running smoothly.

04

Tips & Settings

Get the most out of Skill Forge

Quick Start — Three Sentences to Go

💬

Just say:

"Use Skill Forge to improve my linkedin-content skill"

🎛️

With control:

"Use Skill Forge in guided mode for my humanizer skill — I want to decide at each step"

Key Settings

You can adjust these values at start — or use the defaults:

Experiments

Maximum 10 rounds (default). Increase to 15–20 for complex skills.

Time Budget

120 minutes (default). Set to 360+ for overnight runs.

Threshold

A change needs at least 2% improvement to be kept.

Crash Limit

After 3 consecutive errors, the loop stops. Prevents endless loops.

The Morning Report

After the overnight run, you'll find a summary of everything that happened:

What's in the Report

📊

Score Progression

Where you started and where you are now

✅

Top Improvements

The 3 most impactful changes in detail

❌

What Didn't Work

Failed hypotheses — and why

💡

Recommendations

What you could do next

🛡️ Safety Net

Skill Forge creates a backup (Snapshot) before every change. Even after 10 experiments, you can always return to the original.

05

Frequently Asked Questions

Answers to the most important questions — and what to do when things go wrong

The Key Questions

▸ Can Skill Forge make my skill worse?

No. Every change that lowers the score is automatically reverted. Plus, snapshots let you return to the original state at any time.

▸ What is "overfitting" and why should I care?

Overfitting would mean your skill only aces the test cases but fails on new tasks. Skill Forge protects against this with a train/test split: 60% of tests optimize, 40% verify blindly.

▸ How long does a run take?

Each individual experiment takes 5–15 minutes. A full run with 10 experiments typically takes 1–2 hours. You can adjust the time budget in the wizard.

▸ What do I need to get started?

Just a Claude Cowork account and the skill-forge folder in your skills directory. Then simply say: "Use Skill Forge to improve my [skill-name]" — and the wizard guides you through the rest.

▸ What if the loop crashes?

Skill Forge has built-in crash detection. Individual errors are skipped. Only after 3 consecutive failures does the loop stop and report the issue. Your skill always stays safe.

What does the "Train/Test Split" in Skill Forge protect against?

Data loss during crashes

The skill being optimized only for the test cases

Excessive runtime

Right! The split ensures 40% of tests remain blind — verifying that improvements also work on new, unseen tasks.

Not quite. The train/test split protects against overfitting: 60% of tests optimize, 40% remain blind — ensuring improvements generalize.

Ready to Get Started?

Copy the skill-forge/ folder into your skills directory and tell Claude:

"Use Skill Forge to improve my
[skill-name] skill"

🎉 You're All Set!

You now know everything you need. The wizard handles the rest. Good luck with your improved skill!