Aceforge

Self-evolving skill engine for OpenClaw agents — Phase 1 core + Phase 2 proactive intelligence + Phase 3 self-validation

Install

openclaw plugins install clawhub:aceforge

AceForge

Your agent keeps doing the same things. AceForge turns those patterns into permanent skills.

It watches what tools your agent calls, what fails, and what you correct — then proposes validated, human-approved skills so your agent never has to figure it out from scratch again. Nothing deploys without your say-so.

MIT License OpenClaw Plugin TypeScript Version Tests Adversarial

A self-evolving skill engine for OpenClaw agents.

AceForge watches how your agent works — what tools it calls, what fails, what you correct — and turns
those patterns into permanent, auditable, human-approved skills. Nothing deploys without your approval.


Table of Contents


Why AceForge Exists

Agent skill libraries have a quality problem. Research tells us that 56% of agent skills are never invoked because their descriptions don't match how users actually phrase requests (SkillsBench, arXiv:2602.12670). Community skill marketplaces have a pronounced supply-demand imbalance, with many low-effort skills that underserve users (Ling et al., arXiv:2602.08004). And bad skills don't just fail to help — 16 of 84 benchmark tasks showed negative performance from poor skills (SkillsBench). Meanwhile, the ClawHavoc campaign exposed 1,184 malicious skills across ClawHub, targeting agent identity files for persistence attacks.

AceForge addresses these problems by generating skills from real operational data — not templates, not guesswork — and continuously validating that those skills actually work, stay relevant, and remain secure.

What AceForge Is

AceForge is the skill generation and lifecycle layer for OpenClaw agents. It sits between your agent's raw tool usage and its permanent skill library, converting observed behavior into externalized, auditable SKILL.md files through a research-grounded dual-model LLM pipeline.

What AceForge Is Not

  • Not auto-deploying. Every proposed skill, upgrade, and retirement requires human approval.
  • Not a context engine. Compatible with OpenViking, lossless-claw, or OpenClaw's built-in engine. AceForge generates skills; it doesn't own context.
  • Not ClawHub-hostile. If a ClawHub skill serves your agent well, AceForge leaves it alone. It only proposes upgrades when trace data shows the skill is underperforming.
  • Not a fine-tuning system. Skills are externalized artifacts — inspectable, editable, version-controlled. Not model weights.

Observation-only mode: Set ACEFORGE_DRY_RUN=true to log what skills would be proposed without writing anything to disk. Perfect for evaluating AceForge before committing.

Bounded exception: Correction-driven micro-revisions (anti-pattern appends, instruction notes) self-apply without approval. Full rewrites always require approval.


How It Works

AceForge operates as a 12-stage pipeline that runs continuously alongside your agent:

  1. Observe          2. Detect             3. Generate
  after_tool_call     Group by tool         Generator LLM
  Traces: tool,       Threshold: 3x (5x    Reviewer LLM (CoT)
  args, result,       at 20+ skills)        APPROVE / REVISE / REJECT
  corrections                                       ↓
  4. Validate         5. Score              6. Approve
  23 attack checks    Structural (0-100)    All 25+ channels
  SOUL.md detection   Coverage (0-100)      /forge approve <n>
  Credential scan     LLM judge (40-70)     /forge reject <n>
  Path traversal                                    ↓
  7. Deploy           8. Evolve             9. Retire
  skills/ directory   SRLR distillation     Watchdog flags
  Baseline recorded   at 500/2000/5000      A/B compares versions
  Token-budgeted      /forge evolve + diff  Underperformers flagged
                              ↓
  10. Propagate       11. Compose           12. Validate
  Cross-session       Co-activation         Health tests (CLI/path/URL)
  Capability tree     detection for         Grounded challenges
  Description opt     future composition    Adversarial mutations

Every stage is grounded in peer-reviewed research. See Research Basis for the full citation table.


Observation & Pattern Detection

Every tool call your agent makes is logged with full context: arguments, results, success/failure, session identifier, timing, and duration. Corrections from you (phrases like "no, actually..." or "that's wrong") are captured separately and linked to the nearest tool call by temporal proximity.

Multi-tool chains — 3+ distinct tools called within 60 seconds — are detected automatically and logged with sequence order. Session tool history persists to disk, so chain detection survives gateway restarts.

When a tool crosses the crystallization threshold (3x occurrences, escalating to 5x at 20+ deployed skills to prevent library bloat), it becomes a generation candidate.

Design intent: The escalating threshold implements the diminishing returns finding from Single-Agent scaling (arXiv:2601.04748) — more skills don't always help, and eventually selection quality degrades. AceForge gates quantity to preserve quality.


Dual-Model LLM Pipeline

Skill generation uses two independent LLMs working in sequence:

  1. Generator (default: MiniMax M2.7) writes the SKILL.md from real trace data — actual arguments, actual failures, actual corrections. The prompt enforces progressive disclosure structure (When to Use → Pre-flight → Instructions → Error Recovery → Anti-Patterns) based on SkillsBench's finding that focused skills with 2-3 modules outperform comprehensive documentation.

  2. Reviewer (default: DeepSeek Chat) critiques the generated skill against structured criteria. It evaluates trigger precision, instruction specificity, anti-pattern grounding, and security. Verdict: APPROVE, REVISE (one retry), or REJECT.

Both models are provider-agnostic — any OpenAI-compatible /chat/completions or Anthropic-native /v1/messages endpoint works. Format auto-detected. Rate-limited to 8 calls per cycle with 2-second intervals.

Design intent: The proposer/judge dual-model loop is validated by Multi-Agent Evolve (arXiv:2510.23595). Independent review with structured criteria (not open-ended judging) delivers 8-11% accuracy improvement per DeepVerifier (arXiv:2601.15808).


Quality Scoring & Hybrid LLM Judge

When a skill already exists for a tool, AceForge doesn't skip it. It scores the existing skill on two dimensions:

Structural quality (40% weight): Trigger clarity, progressive disclosure sections, procedural depth, anti-pattern grounding, conciseness, metadata completeness, security hygiene. Pure text analysis — no LLM calls, runs in milliseconds.

Coverage (60% weight): Argument pattern coverage vs. your actual traces, failure coverage vs. your observed errors, correction coverage vs. your user fixes, usage recency, success improvement since deployment.

Combined ScoreActionMethod
< 40Auto-propose upgradeDeterministic only (zero LLM cost)
40–70LLM judge evaluatesHybrid: 50% deterministic + 50% semantic
> 70Leave it aloneSkill is adequate

Design intent: The hybrid approach reserves expensive LLM calls for genuinely ambiguous cases. The 40-70 "ambiguous zone" is where deterministic scoring can't confidently decide, so the reviewer LLM provides semantic evaluation against actual trace samples.


Skill Evolution & Lifecycle

Three Analysis Paths

Every agent_end cycle evaluates tools through three explicit paths:

  1. Evolution — Deployed skill with 50+ new traces → revise with new data
  2. Upgrade — Deployed skill scoring below 60 → propose a replacement
  3. New proposal — No existing skill → generate from scratch

This ensures no tool falls through the cracks regardless of its deployment state.

Evolution Over Regeneration

After 50+ new traces accumulate since deployment, AceForge revises existing skills rather than regenerating from scratch. The revision prompt includes only the new data — new success patterns, new failures, new corrections — and instructs the generator to preserve what works while updating what doesn't.

Design intent: Trajectory-level revision outperforms full regeneration per SE-Agent (arXiv:2508.02085). Skills accumulate operational wisdom over time rather than losing it to rewrites.

Milestone Distillation

At activation milestones (500, 2000, 5000), AceForge runs a Summarize–Reflect–Locate–Revise (SRLR) cycle on each deployed skill's trace corpus. The distillation computes what argument patterns, failure modes, and user corrections emerged since deployment, identifies divergences between the skill's instructions and actual usage, and surfaces actionable recommendations. Meaningful divergences trigger a notification; /forge distill <skill> shows the full report anytime.

Design intent: The SRLR loop follows K2-Agent (arXiv:2603.00676)'s knowledge refinement approach. Milestone-based checkpoints (not continuous mutation) come from SAGE (arXiv:2512.17102)'s Sequential Rollout — skills accumulate operational wisdom at each tier rather than evolving reactively.

Novel Success Capture

When the agent successfully uses a tool for the first time — one that has no existing skill, no pending proposal, and no prior failures — AceForge captures it as a novel one-shot success. These captures queue for human review via /forge captures and can be promoted to lower the crystallization threshold for that tool, or dismissed.

Design intent: Inspired by Voyager (arXiv:2305.16291)'s self-verification before adding skills to the library. Each capture passes a 6-check novelty filter before being recorded.

Maturity Stages

Skills progress through maturity stages based on real-world performance:

  • ProposedDeployedCommitted (50+ activations, 75%+ success, 14+ days) → Mature
  • Apoptosis detection flags skills with sustained low activation or degraded success rates
  • Version history — every deploy, upgrade, micro-revision, rollback, retire, and reinstate is recorded with full SKILL.md content. /forge history shows the timeline; /forge diff shows what changed between versions. Zero dependencies — LCS-based diff engine built in.
  • Effectiveness watchdog runs A/B comparisons when upgrades are deployed

Design intent: Memento-Skills (arXiv:2603.18743) write phase — the agent updates and expands its skill library based on new experience. Micro-revisions are the fast path; full rewrites are the deliberate path.


Intelligence

Six modules run on every agent_end hook via setImmediate (non-blocking), continuously identifying where the agent needs improvement, propagating learning across sessions, and autonomously adjusting skills from corrections.

Capability Tree

All skills are organized into a hierarchical capability tree with gap scoring per domain. Domains are categorized recursively — exec-docker, exec-ssh, read-code each fall under their parent tool's domain. Gap scores increment on every detected fallback, deferral, or infrastructure failure.

Design intent: AgentSkillOS (arXiv:2603.02176) found that DAG-based pipelines substantially outperform flat invocation even with identical skill sets. AceForge's capability tree provides the structural foundation for ecosystem-level management.

Cross-Session Propagation

Pattern data aggregates across all communication channels (Telegram, Slack, Discord, iMessage) into a persistent JSON state. Tools that recur across sessions but haven't crystallized into skills are flagged as cross-session candidates.

Design intent: Memento-Skills (arXiv:2603.18743) — skills persist across sessions as evolving procedural memory. Cross-session state is designed for integration with memory-augmented MDP systems per Memento (arXiv:2508.16153).

Skill Composition Detection

When two skills co-activate in >50% of sessions across 3+ sessions, AceForge detects the co-activation pattern and reports it as a composition candidate. The detection uses per-session tool matching against active skill prefixes.

Design intent: AgentSkillOS (arXiv:2603.02176) found that DAG-based pipelines substantially outperform flat invocation even with identical skill sets. AceForge's composition detection identifies the candidates; DAG orchestration is the target for future composition generation.

Proactive Gap Detection

On every agent_end, AceForge analyzes pattern data for four behavior categories that indicate capability gaps:

PatternWhat It DetectsExample
FallbackAgent can't perform a task"I can't do that" / "you'll need to manually"
DeferralAgent asks permission when it should act"let me know if you want me to..."
UncertaintyAgent lacks confidence"I think" / "I'm not sure"
InfrastructureMissing tools or access"requires installation" / "not found"

Each detection increments the relevant domain's gap score in the capability tree. Critical gaps (5+ occurrences) trigger notifications.

Design intent: EvoSkill (arXiv:2603.02766) demonstrates failure-driven skill discovery through a Proposer agent that analyzes failure traces and suggests improvements. AceForge implements this as continuous passive monitoring rather than active probing.

Description Optimization

Periodically compares each skill's description against actual conversation language using token overlap analysis. Skills with <30% overlap between their trigger description and how you actually phrase related requests are flagged — because description IS the discovery mechanism.

Design intent: SkillsBench (arXiv:2602.12670) found that 56% of skills are never invoked because descriptions don't match user intent. This module ensures skills stay findable as your language evolves over time.

Autonomous Skill Adjustment

When corrections are detected, AceForge matches them to the active skill by temporal proximity and applies micro-revisions immediately (no approval needed):

  • Anti-pattern append — "User correction for exec: use --rm flag (original: docker run nginx)"
  • Instruction addendum — adds a note to the instructions section
  • Correction log — HTML comment with full correction context

After 3+ micro-revisions in 30 days, AceForge triggers a full LLM rewrite proposal (with approval).

Design intent: Memento-Skills (arXiv:2603.18743) write phase — the agent updates and expands its skill library based on new experience. Micro-revisions are the fast path; full rewrites are the deliberate path.


Validation

Three modules ensure deployed skills actually work, generate realistic test scenarios, and verify the security validator itself.

Skill Health Testing

Periodic validation that installed skills reference real, working infrastructure:

  • CLI commands — extracted from SKILL.md, verified via which (e.g., ssh, docker, git)
  • File paths — extracted from backtick references, verified via existsSync
  • API endpoints — extracted from URLs, health-checked via HEAD request (5s timeout)

Skills that fail health tests are flagged with specific failure reasons.

Design intent: EvoSkill (arXiv:2603.02766) retains only skills that improve held-out validation performance. Health testing is the infrastructure-level equivalent — ensuring skills don't reference binaries that have been uninstalled or endpoints that have moved.

Grounded Challenges

Generates realistic test scenarios from operational context:

  1. Query OpenViking for recent context related to each skill's tool domain
  2. Generate task prompts grounded in real operational data
  3. Fall back to pattern-based generation when Viking is unavailable

Challenges are logged for tracking skill activation patterns over time.

Design intent: SE-Agent (arXiv:2508.02085) demonstrates curriculum generation for progressive testing of agent capabilities. Grounded challenges prevent the "teaching to the test" problem by generating scenarios from real-world context.

Adversarial Robustness

Mutation testing against the security validator with 23 attack variants:

CategoryMutations
Prompt injectionignore-instructions, disregard-prior, you-are-now, forget-everything, multiline-split
Credential exfilAPI key, password, long token, env var exfiltration
Persistence attacksSOUL.md write, MEMORY.md write, IDENTITY.md write
EvasionBase64-encoded payload, homoglyph/IDN domain
StructuralPath traversal, overlength, missing name, missing description, unknown domain
Credential harvestingBare tilde path (~/.ssh), git credential URL, bash history read, Telegram bot token

The adversarial suite runs at startup. Results are displayed in the startup dashboard. Current: 23/23 caught.

Design intent: Chen et al. (arXiv:2602.12430) found a 26.1% vulnerability rate in community-contributed skills. The ClawHavoc campaign validated that SOUL.md/MEMORY.md targeting is the primary real-world attack vector. AceForge's adversarial suite is specifically designed around these threat models.


Security

Every generated skill passes through the security validator before you ever see it:

  • Prompt injection detection — catches "ignore previous instructions" and variants, including multiline split injection across numbered lists
  • Credential scanning — flags API keys, tokens, passwords in plaintext
  • Base64 payload detection — catches encoded payloads piped to shell/eval
  • Homoglyph/IDN domain detection — catches Cyrillic and other confusable characters in domain names
  • Environment variable exfiltration — detects $SECRET_KEY in URL contexts
  • Bare tilde path detection — catches ~/.ssh, ~/.bash_history and similar sensitive home-relative paths
  • Git credential URL detection — flags embedded tokens in git clone URLs (e.g., ghp_...@github.com)
  • Bash history read detection — catches credential harvesting from shell history files
  • Telegram bot token detection — flags bot tokens embedded in skill instructions
  • Path traversal prevention — resolves paths against workspace boundary, including backtick-wrapped paths
  • SOUL.md/MEMORY.md/IDENTITY.md write detection — the primary ClawHavoc attack vector
  • Skill conflict detection — Jaccard+bigram hybrid similarity blocks 95%+ description overlap, warns at 80%+. Proposal dedup checks name prefix, bundledTools, and existing proposals for the same tool
  • ClawHub dedup — checks if a skill already exists on ClawHub before proposing
  • Network domain allowlist — warns on unrecognized domains
  • LLM output size limit — generated skills capped at 50KB
  • Skill name validation — names with path characters rejected at proposal time
  • Upgrade validation — upgrades pass through the full validator before the old skill is retired
  • Rollback safety — retired versions are validated before the active version is deleted
  • LLM rate limiting — 2s interval, 8 calls/cycle max
  • Trace data sanitization — pattern data is sanitized before injection into LLM prompts

Commands

AceForge uses a single /forge command with subcommands:

Core

CommandDescription
/forgeDashboard — skills, proposals, patterns, gaps
/forge approve <n>Deploy a proposed skill
/forge reject <n>Reject a proposal (or reject all)
/forge upgrade <n>Deploy upgrade, retire old (with validation)
/forge rollback <n>Undo an upgrade (with validation)
/forge retire <n>Retire an active skill
/forge reinstate <n>Bring back a retired skill

Diagnostics

CommandDescription
/forge listFull inventory — active, proposed, retired
/forge quality <n>Score a skill against actual usage data
/forge gapsAll capability gaps — tool failures + behavior + cross-session
/forge watchdogEffectiveness check — flags underperformers
/forge filteredWhat quality gates suppressed and why
/forge preview <n>Human-readable skill brief before approving

Intelligence

CommandDescription
/forge treeCapability tree with gap scores per domain
/forge cross_sessionCross-session pattern analysis
/forge composeSkill co-activation analysis
/forge behavior_gapsFallback / deferral / uncertainty detection
/forge optimizeDescription-language mismatch report

Evolution

CommandDescription
/forge evolve <n>LLM-powered skill revision with trace delta + unified diff
/forge distill <n>SRLR trace distillation report (no LLM revision)
/forge capturesList novel one-shot success captures
/forge capture promote <tool>Promote a capture for crystallization
/forge capture dismiss <tool>Dismiss a capture

History

CommandDescription
/forge history <n>Version history timeline
/forge diff <n> [v]Unified diff between versions

Validation

CommandDescription
/forge testHealth tests on all deployed skills
/forge challengeGrounded challenge scenario generation
/forge adversarialAdversarial mutation suite (23 variants)

Agent-Callable Tools

These tools are registered for programmatic use by the agent itself: forge, forge_reflect, forge_propose, forge_approve_skill, forge_reject_skill, forge_quality, forge_registry, forge_rewards, forge_tree, forge_gaps


Installation

One command:

openclaw plugins install aceforge

Then restart your gateway:

openclaw gateway restart

Verify:

openclaw plugins list | grep aceforge

Alternative install methods:

# From npm directly
npm install aceforge

# From source (for development)
git clone https://github.com/sudokrang/aceforge.git ~/.openclaw/extensions/aceforge
cd ~/.openclaw/extensions/aceforge && npm install

Configuration

Provider Agnostic

Both generator and reviewer support OpenAI-compatible (/chat/completions) and Anthropic-native (/v1/messages) endpoints. Format auto-detected from openclaw.json or provider name. Any provider works:

ProviderBase URLNotes
MiniMax (default generator)https://api.minimax.io/v1M2.7 — strong structured output
DeepSeek (default reviewer)https://api.deepseek.comChat — structured rubric review
OpenAIhttps://api.openai.com/v1GPT-4o or GPT-5.4
Anthropichttps://api.anthropic.comClaude via /v1/messages — auto-detected
OpenRouterhttps://openrouter.ai/api/v1Claude, Gemini, Llama, etc.
Togetherhttps://api.together.xyz/v1Llama, Mixtral, open models
Groqhttps://api.groq.com/openai/v1Fast inference — Llama, Gemma
Cerebrashttps://api.cerebras.ai/v1Wafer-scale inference
Hugging Facehttps://api-inference.huggingface.co/v1Any HF Inference model
Kimi (Moonshot)https://api.moonshot.cn/v1Kimi K2.5
Ollamahttp://127.0.0.1:11434/v1Local — fully offline
LM Studiohttp://127.0.0.1:1234/v1Local — fully offline
vLLMhttp://127.0.0.1:8000/v1Local — high-throughput serving

Channel Agnostic

Notifications work across all 25+ OpenClaw channels. The formatting layer operates on format types, not channel names:

FormatChannelsBoldCode
htmlTelegram, email<b><code>
mrkdwnSlack*single*`
markdownDiscord, Matrix**double**`
plainEverything elsepassthroughpassthrough

Plain text with Unicode + emoji is the primary design target — rich formatting is a polish layer. Adding a new channel: one line in FORMAT_MAP.

OpenViking Compatible

AceForge is fully compatible with OpenViking for context-enriched challenge generation. Circuit breaker: 5s timeout, 3 failures → open for 10 min.

Environment Variables
VariableDefaultDescription
ACEFORGE_GENERATOR_PROVIDERminimaxProvider for skill generation
ACEFORGE_GENERATOR_API_KEYfrom openclaw.jsonAPI key override
ACEFORGE_GENERATOR_MODELMiniMax-M2.7Model override
ACEFORGE_GENERATOR_URLhttps://api.minimax.io/v1Base URL override
ACEFORGE_REVIEWER_PROVIDERdeepseekProvider for skill review + LLM judge
ACEFORGE_REVIEWER_API_KEYfrom openclaw.jsonAPI key override
ACEFORGE_REVIEWER_MODELdeepseek-chatModel override
ACEFORGE_REVIEWER_URLhttps://api.deepseek.comBase URL override
ACEFORGE_NOTIFICATION_CHANNELauto-detectForce: telegram, slack, log
ACEFORGE_TELEGRAM_BOT_TOKENfrom openclaw.jsonTelegram bot token
ACEFORGE_OWNER_CHAT_IDfrom openclaw.jsonTelegram chat ID
ACEFORGE_SLACK_WEBHOOK_URLSlack incoming webhook
ACEFORGE_VIKING_URLhttp://127.0.0.1:1933OpenViking URL (optional)
ACEFORGE_DRY_RUNfalseObservation-only mode — log proposals without writing to disk
ACEFORGE_SHARED_SKILLSfalseDeploy approved skills to ~/.openclaw/skills/ (shared across all agents)
Quick Start Examples

OpenAI + Slack:

export ACEFORGE_GENERATOR_PROVIDER=openai
export ACEFORGE_GENERATOR_API_KEY=sk-...
export ACEFORGE_REVIEWER_PROVIDER=openai
export ACEFORGE_REVIEWER_MODEL=gpt-4o
export ACEFORGE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

Anthropic (Claude):

export ACEFORGE_GENERATOR_PROVIDER=anthropic
export ACEFORGE_GENERATOR_API_KEY=sk-ant-...
export ACEFORGE_REVIEWER_PROVIDER=anthropic
export ACEFORGE_REVIEWER_API_KEY=sk-ant-...

Local models via LM Studio:

export ACEFORGE_GENERATOR_PROVIDER=lmstudio
export ACEFORGE_GENERATOR_URL=http://127.0.0.1:1234/v1
export ACEFORGE_GENERATOR_API_KEY=not-needed
export ACEFORGE_REVIEWER_PROVIDER=lmstudio
export ACEFORGE_REVIEWER_URL=http://127.0.0.1:1234/v1
export ACEFORGE_REVIEWER_API_KEY=not-needed

Architecture

AceForge Architecture

File Structure
~/.openclaw/extensions/aceforge/
├── openclaw.plugin.json        # Plugin manifest + configSchema
├── index.ts                    # Entry — hooks, tools, /forge router, startup
├── tests/
│   └── test-validator.ts       # 412 assertions — validator, quality, adversarial, drift detection
└── src/
    ├── notify.ts               # Transport layer (Telegram / Slack / log) + HTML sanitizer
    ├── notify-format.ts        # FORMAT_MAP architecture — format types, not channel names
    ├── pattern/
    │   ├── constants.ts        # Canonical blocklists — TOOL, CAPTURE, NATIVE_TOOLS, SELF_TOOLS
    │   ├── store.ts            # JSONL with rotation (10K lines, 30 days, gzip)
    │   ├── capture.ts          # after_tool_call — trace + chain logging + session persistence
    │   ├── detect.ts           # Correction detection from user messages
    │   ├── analyze.ts          # Pattern analysis orchestrator — 3-path loop
    │   ├── analyze-utils.ts    # Filesystem helpers — dedup checks, file readers
    │   ├── analyze-native.ts   # Native tool sub-pattern clustering + domain extraction
    │   ├── analyze-chains.ts   # Workflow chain analysis → multi-tool skill proposals
    │   └── gap-detect.ts       # Gap analysis engine (tool-level)
    ├── skill/
    │   ├── generator.ts        # Template fallback generator
    │   ├── llm-generator.ts    # Dual-model pipeline + workflow + remediation + upgrade
    │   ├── llm-judge.ts        # LLM-as-judge for ambiguous quality scores (40-70)
    │   ├── quality-score.ts    # Deterministic structural + coverage scoring
    │   ├── validator.ts        # Security gate — 23 attack patterns + similarity + SOUL.md
    │   ├── history.ts          # Version history — recordRevision, LCS diff, timeline
    │   ├── lifecycle.ts        # Activation tracking, health cache, A/B, watchdog, baselines
    │   └── index.ts            # Skill index — metadata-only context injection (3K token budget)
    ├── intelligence/
    │   ├── capability-tree.ts  # Recursive domain categorization + gap scoring
    │   ├── cross-session.ts    # Cross-session pattern aggregation
    │   ├── composition.ts      # Co-activation detection
    │   ├── proactive-gaps.ts   # Fallback/deferral/uncertainty/infrastructure detection
    │   ├── description-optimizer.ts  # Token overlap analysis for trigger optimization
    │   └── auto-adjust.ts      # Micro-revisions from corrections
    ├── validation/
    │   ├── health-test.ts      # Verify CLIs, paths, endpoints
    │   ├── grounded-challenges.ts  # Test scenarios from Viking/patterns
    │   └── adversarial.ts      # 23 mutation variants against validator
    ├── evolution/
    │   ├── distill.ts          # SRLR trace distillation at milestones (500/2000/5000)
    │   ├── capture-novel.ts    # Novel one-shot success capture
    │   └── evolve-command.ts   # /forge evolve — LLM revision + unified diff
    └── viking/
        └── client.ts           # OpenViking context engine client (circuit breaker)

RL & Ecosystem Integration

AceForge exposes machine-readable interfaces designed for integration with frontier agentic research systems:

MetaClaw & OpenClaw-RL

The forge_registry and forge_rewards tools provide structured data for reinforcement learning training loops:

  • forge_registry — machine-readable skill catalog with per-skill success rates, activation counts, deployment paths, and source attribution
  • forge_rewards — per-skill reward signals (success rate, count, last updated) formatted for direct consumption by RL training pipelines

These interfaces are designed to support MetaClaw (arXiv:2603.17187) proxy-based meta-learning and OpenClaw-RL (arXiv:2603.10165) reinforcement learning from deployment feedback.

Capability Tree as Ecosystem Signal

The forge_tree tool returns a structured JSON capability tree with gap scores per domain. This enables ecosystem-level management: which domains need attention, where to allocate development effort, and which skills are driving the most value.

The tree structure is directly compatible with AgentSkillOS (arXiv:2603.02176)'s recursive categorization model, enabling future integration with multi-agent skill sharing and orchestration systems.

Cross-Session State

The cross-session pattern state (cross-session-patterns.json) provides a persistent view of tool usage across all communication channels. This data surface is designed for integration with memory-augmented MDP systems per Memento (arXiv:2508.16153), enabling case-based skill selection from deployment experience.


Research Basis

Every major design decision in AceForge is grounded in peer-reviewed research. The full citation table:

35 citations across 14 research areas
ConceptPaperHow AceForge Uses It
Skills fail without proper triggersSkillsBench (Feb 2026)Description-first prompt design; 56% invocation failure validates trigger optimization
Bad skills hurt performanceSkillsBench (Feb 2026)Quality scoring engine; upgrade proposals when skills score < 60/100
Focused > comprehensiveSkillsBench (Feb 2026)150-line limit; 2-3 dominant pattern focus in generator prompt
LLM skills can degradeIoT-SkillsBench (Mar 2026)Effectiveness watchdog; baseline comparison; auto-flagging
Hierarchical skill organizationSkillRL (Feb 2026)Category metadata in frontmatter; domain classification
Controller-Executor-DesignerMemSkill (Feb 2026)Analyze (controller) → Generate (executor) → Evolve (designer)
Skill co-evolution with contextMCE (Jan 2026)Skills evolve from new trace data; trajectory-level revision
Selection degrades at scaleSingle-Agent scaling (Jan 2026)Escalating threshold; quality gating prevents library bloat
Proposer/Judge dual-modelMulti-Agent Evolve (Oct 2025)Generator + independent Reviewer pipeline
Rubric-guided verificationDeepVerifier (Jan 2026)Structured review criteria in reviewer prompt (8-11% improvement)
Cumulative skill creationCASCADE (Dec 2025)Self-evolving skill framework with human-gated deployment
Trajectory-level revisionSE-Agent (2025)Skills revised from new data, not regenerated from scratch
Hierarchical procedural memoryMACLA, AAMAS 2026Chain-to-workflow composition for multi-tool sequences
Skill vulnerability prevalenceChen et al. (Feb 2026)26.1% vulnerability rate validates adversarial testing approach
Progressive disclosureChen et al. (Feb 2026)3-level architecture: metadata-only → instructions → scripts
Learned → externalized skillsChen et al. (Feb 2026)AceForge bridges implicit tool patterns to explicit SKILL.md files
Marketplace skill imbalanceLing et al. (Feb 2026)Quality scoring + upgrade proposals for underperforming skills
Proxy-based meta-learningMetaClaw (Mar 2026)Registry + rewards tools for MetaClaw/OpenClaw-RL integration
Inter-task skill evolutionFang et al. Survey (Aug 2025)Workflow consolidation across sessions
Procedural + semantic memoryJeunen et al. (May 2025)Gap analysis augments with failure-driven awareness
Supply chain attack at scaleClawHavoc / Antiy CERT (Feb 2026)1,184 malicious skills; SOUL.md write detection + adversarial testing
Capability tree at ecosystem scaleAgentSkillOS (Mar 2026)Recursive categorization; tree-based retrieval; gap scoring
Read-Write Reflective LearningMemento-Skills (Mar 2026)Cross-session propagation; autonomous skill adjustment
Failure-driven skill discoveryEvoSkill (Mar 2026)Proactive gap detection; health validation
Memory-augmented MDPMemento (2025)Case-based reasoning for skill selection from deployment experience
Self-evolving agent frameworkSelf-Evolving Agents Survey (Jul 2025)Comprehensive framework: environment, experience, self evolution
RL from deployment feedbackOpenClaw-RL (Mar 2026)forge_rewards tool provides RL-compatible reward signals
DAG-based pipeline compositionAgentSkillOS (Mar 2026)Co-activation detection for future DAG orchestration
Multi-agent skill sharingAgentSkillOS (Mar 2026)Capability tree structure for multi-agent coordination
Skill persistence as memoryMemento-Skills (Mar 2026)Skills persist across sessions as evolving procedural memory
Milestone-based skill accumulationSAGE (Dec 2025)Sequential Rollout — distillation at 500/2000/5000 activation milestones
Summarize–Reflect–Locate–ReviseK2-Agent (Mar 2026)SRLR loop for trace distillation and knowledge refinement
On-policy skill preservationSDFT (Jan 2026)Self-distillation preserves prior capabilities during evolution
Self-verification before library addVoyager (May 2023)Novel capture validates first-time successes before queuing
Autonomous experiential learningSEAgent (Aug 2025)Specialist-to-generalist training; curriculum-based task generation

What AceForge Is

AceForge is a skill engine. It generates, validates, and manages SKILL.md files — permanent, auditable artifacts crystallized from your agent's actual behavior.

It is not a memory system, a prompt optimizer, or an RL trainer. AceForge produces one thing: validated SKILL.md files crystallized from your agent's real operational patterns.


Requirements

  • OpenClaw 2026.3.22 or later
  • Node.js 22+
  • At least one OpenAI-compatible LLM API key

Traces auto-rotate at 10K lines or 30 days (whichever comes first) with gzip archival. No manual cleanup needed.


Contributing

Contributions are welcome. Please open an issue first to discuss what you'd like to change.

If you're running the test suite:

npx tsx tests/test-validator.ts

All tool blocklists are defined in src/pattern/constants.ts. The test suite enforces zero-drift: if you add a new tool to any blocklist, all source files must import from constants.ts or tests will fail. This is intentional.


License

MIT — see LICENSE


Built by sudokrang · Grounded in peer-reviewed research · Nothing deploys without your approval