Aceforge

Self-evolving skill engine for OpenClaw agents — Phase 1 core + Phase 2 proactive intelligence + Phase 3 self-validation

Install

openclaw plugins install clawhub:aceforge

AceForge

Your agent keeps doing the same things. AceForge turns those patterns into permanent skills.

It watches what tools your agent calls, what fails, and what you correct — then proposes validated, human-approved skills so your agent never has to figure it out from scratch again. Nothing deploys without your say-so.

Tests Adversarial

A self-evolving skill engine for OpenClaw agents.

AceForge watches how your agent works — what tools it calls, what fails, what you correct — and turns
those patterns into permanent, auditable, human-approved skills. Nothing deploys without your approval.

Design Philosophy
Why AceForge Exists
How It Works
Observation & Pattern Detection
Dual-Model LLM Pipeline
Quality Scoring & Hybrid LLM Judge
Skill Evolution & Lifecycle
Intelligence
Validation
Evolution
Security
Commands
Installation
Configuration
Architecture
RL & Ecosystem Integration
Research Basis
Requirements
Contributing
License

Why AceForge Exists

Agent skill libraries have a quality problem. Research tells us that 56% of agent skills are never invoked because their descriptions don't match how users actually phrase requests (SkillsBench, arXiv:2602.12670). Community skill marketplaces have a pronounced supply-demand imbalance, with many low-effort skills that underserve users (Ling et al., arXiv:2602.08004). And bad skills don't just fail to help — 16 of 84 benchmark tasks showed negative performance from poor skills (SkillsBench). Meanwhile, the ClawHavoc campaign exposed 1,184 malicious skills across ClawHub, targeting agent identity files for persistence attacks.

AceForge addresses these problems by generating skills from real operational data — not templates, not guesswork — and continuously validating that those skills actually work, stay relevant, and remain secure.

What AceForge Is

AceForge is the skill generation and lifecycle layer for OpenClaw agents. It sits between your agent's raw tool usage and its permanent skill library, converting observed behavior into externalized, auditable SKILL.md files through a research-grounded dual-model LLM pipeline.

What AceForge Is Not

Not auto-deploying. Every proposed skill, upgrade, and retirement requires human approval.
Not a context engine. Compatible with OpenViking, lossless-claw, or OpenClaw's built-in engine. AceForge generates skills; it doesn't own context.
Not ClawHub-hostile. If a ClawHub skill serves your agent well, AceForge leaves it alone. It only proposes upgrades when trace data shows the skill is underperforming.
Not a fine-tuning system. Skills are externalized artifacts — inspectable, editable, version-controlled. Not model weights.

Observation-only mode: Set ACEFORGE_DRY_RUN=true to log what skills would be proposed without writing anything to disk. Perfect for evaluating AceForge before committing.

Bounded exception: Correction-driven micro-revisions (anti-pattern appends, instruction notes) self-apply without approval. Full rewrites always require approval.

How It Works

AceForge operates as a 12-stage pipeline that runs continuously alongside your agent:

  1. Observe          2. Detect             3. Generate
  after_tool_call     Group by tool         Generator LLM
  Traces: tool,       Threshold: 3x (5x    Reviewer LLM (CoT)
  args, result,       at 20+ skills)        APPROVE / REVISE / REJECT
  corrections                                       ↓
  4. Validate         5. Score              6. Approve
  23 attack checks    Structural (0-100)    All 25+ channels
  SOUL.md detection   Coverage (0-100)      /forge approve <n>
  Credential scan     LLM judge (40-70)     /forge reject <n>
  Path traversal                                    ↓
  7. Deploy           8. Evolve             9. Retire
  skills/ directory   SRLR distillation     Watchdog flags
  Baseline recorded   at 500/2000/5000      A/B compares versions
  Token-budgeted      /forge evolve + diff  Underperformers flagged
                              ↓
  10. Propagate       11. Compose           12. Validate
  Cross-session       Co-activation         Health tests (CLI/path/URL)
  Capability tree     detection for         Grounded challenges
  Description opt     future composition    Adversarial mutations

Every stage is grounded in peer-reviewed research. See Research Basis for the full citation table.

Observation & Pattern Detection

Every tool call your agent makes is logged with full context: arguments, results, success/failure, session identifier, timing, and duration. Corrections from you (phrases like "no, actually..." or "that's wrong") are captured separately and linked to the nearest tool call by temporal proximity.

Multi-tool chains — 3+ distinct tools called within 60 seconds — are detected automatically and logged with sequence order. Session tool history persists to disk, so chain detection survives gateway restarts.

When a tool crosses the crystallization threshold (3x occurrences, escalating to 5x at 20+ deployed skills to prevent library bloat), it becomes a generation candidate.

Design intent: The escalating threshold implements the diminishing returns finding from Single-Agent scaling (arXiv:2601.04748) — more skills don't always help, and eventually selection quality degrades. AceForge gates quantity to preserve quality.

Dual-Model LLM Pipeline

Skill generation uses two independent LLMs working in sequence:

Generator (default: MiniMax M2.7) writes the SKILL.md from real trace data — actual arguments, actual failures, actual corrections. The prompt enforces progressive disclosure structure (When to Use → Pre-flight → Instructions → Error Recovery → Anti-Patterns) based on SkillsBench's finding that focused skills with 2-3 modules outperform comprehensive documentation.
Reviewer (default: DeepSeek Chat) critiques the generated skill against structured criteria. It evaluates trigger precision, instruction specificity, anti-pattern grounding, and security. Verdict: APPROVE, REVISE (one retry), or REJECT.

Both models are provider-agnostic — any OpenAI-compatible /chat/completions or Anthropic-native /v1/messages endpoint works. Format auto-detected. Rate-limited to 8 calls per cycle with 2-second intervals.

Design intent: The proposer/judge dual-model loop is validated by Multi-Agent Evolve (arXiv:2510.23595). Independent review with structured criteria (not open-ended judging) delivers 8-11% accuracy improvement per DeepVerifier (arXiv:2601.15808).

Quality Scoring & Hybrid LLM Judge

When a skill already exists for a tool, AceForge doesn't skip it. It scores the existing skill on two dimensions:

Structural quality (40% weight): Trigger clarity, progressive disclosure sections, procedural depth, anti-pattern grounding, conciseness, metadata completeness, security hygiene. Pure text analysis — no LLM calls, runs in milliseconds.

Coverage (60% weight): Argument pattern coverage vs. your actual traces, failure coverage vs. your observed errors, correction coverage vs. your user fixes, usage recency, success improvement since deployment.

Combined Score	Action	Method
< 40	Auto-propose upgrade	Deterministic only (zero LLM cost)
40–70	LLM judge evaluates	Hybrid: 50% deterministic + 50% semantic
> 70	Leave it alone	Skill is adequate

Design intent: The hybrid approach reserves expensive LLM calls for genuinely ambiguous cases. The 40-70 "ambiguous zone" is where deterministic scoring can't confidently decide, so the reviewer LLM provides semantic evaluation against actual trace samples.

Skill Evolution & Lifecycle

Three Analysis Paths

Every agent_end cycle evaluates tools through three explicit paths:

Evolution — Deployed skill with 50+ new traces → revise with new data
Upgrade — Deployed skill scoring below 60 → propose a replacement
New proposal — No existing skill → generate from scratch

This ensures no tool falls through the cracks regardless of its deployment state.

Evolution Over Regeneration

After 50+ new traces accumulate since deployment, AceForge revises existing skills rather than regenerating from scratch. The revision prompt includes only the new data — new success patterns, new failures, new corrections — and instructs the generator to preserve what works while updating what doesn't.

Design intent: Trajectory-level revision outperforms full regeneration per SE-Agent (arXiv:2508.02085). Skills accumulate operational wisdom over time rather than losing it to rewrites.

Milestone Distillation

At activation milestones (500, 2000, 5000), AceForge runs a Summarize–Reflect–Locate–Revise (SRLR) cycle on each deployed skill's trace corpus. The distillation computes what argument patterns, failure modes, and user corrections emerged since deployment, identifies divergences between the skill's instructions and actual usage, and surfaces actionable recommendations. Meaningful divergences trigger a notification; /forge distill <skill> shows the full report anytime.

Design intent: The SRLR loop follows K2-Agent (arXiv:2603.00676)'s knowledge refinement approach. Milestone-based checkpoints (not continuous mutation) come from SAGE (arXiv:2512.17102)'s Sequential Rollout — skills accumulate operational wisdom at each tier rather than evolving reactively.

Novel Success Capture

When the agent successfully uses a tool for the first time — one that has no existing skill, no pending proposal, and no prior failures — AceForge captures it as a novel one-shot success. These captures queue for human review via /forge captures and can be promoted to lower the crystallization threshold for that tool, or dismissed.

Design intent: Inspired by Voyager (arXiv:2305.16291)'s self-verification before adding skills to the library. Each capture passes a 6-check novelty filter before being recorded.

Maturity Stages

Skills progress through maturity stages based on real-world performance:

Proposed → Deployed → Committed (50+ activations, 75%+ success, 14+ days) → Mature
Apoptosis detection flags skills with sustained low activation or degraded success rates
Version history — every deploy, upgrade, micro-revision, rollback, retire, and reinstate is recorded with full SKILL.md content. /forge history shows the timeline; /forge diff shows what changed between versions. Zero dependencies — LCS-based diff engine built in.
Effectiveness watchdog runs A/B comparisons when upgrades are deployed

Design intent: Memento-Skills (arXiv:2603.18743) write phase — the agent updates and expands its skill library based on new experience. Micro-revisions are the fast path; full rewrites are the deliberate path.

Intelligence

Six modules run on every agent_end hook via setImmediate (non-blocking), continuously identifying where the agent needs improvement, propagating learning across sessions, and autonomously adjusting skills from corrections.

Capability Tree

All skills are organized into a hierarchical capability tree with gap scoring per domain. Domains are categorized recursively — exec-docker, exec-ssh, read-code each fall under their parent tool's domain. Gap scores increment on every detected fallback, deferral, or infrastructure failure.

Design intent: AgentSkillOS (arXiv:2603.02176) found that DAG-based pipelines substantially outperform flat invocation even with identical skill sets. AceForge's capability tree provides the structural foundation for ecosystem-level management.

Cross-Session Propagation

Pattern data aggregates across all communication channels (Telegram, Slack, Discord, iMessage) into a persistent JSON state. Tools that recur across sessions but haven't crystallized into skills are flagged as cross-session candidates.

Design intent: Memento-Skills (arXiv:2603.18743) — skills persist across sessions as evolving procedural memory. Cross-session state is designed for integration with memory-augmented MDP systems per Memento (arXiv:2508.16153).

Skill Composition Detection

When two skills co-activate in >50% of sessions across 3+ sessions, AceForge detects the co-activation pattern and reports it as a composition candidate. The detection uses per-session tool matching against active skill prefixes.

Design intent: AgentSkillOS (arXiv:2603.02176) found that DAG-based pipelines substantially outperform flat invocation even with identical skill sets. AceForge's composition detection identifies the candidates; DAG orchestration is the target for future composition generation.

Proactive Gap Detection

On every agent_end, AceForge analyzes pattern data for four behavior categories that indicate capability gaps:

Pattern	What It Detects	Example
Fallback	Agent can't perform a task	"I can't do that" / "you'll need to manually"
Deferral	Agent asks permission when it should act	"let me know if you want me to..."
Uncertainty	Agent lacks confidence	"I think" / "I'm not sure"
Infrastructure	Missing tools or access	"requires installation" / "not found"

Each detection increments the relevant domain's gap score in the capability tree. Critical gaps (5+ occurrences) trigger notifications.

Design intent: EvoSkill (arXiv:2603.02766) demonstrates failure-driven skill discovery through a Proposer agent that analyzes failure traces and suggests improvements. AceForge implements this as continuous passive monitoring rather than active probing.

Description Optimization

Periodically compares each skill's description against actual conversation language using token overlap analysis. Skills with <30% overlap between their trigger description and how you actually phrase related requests are flagged — because description IS the discovery mechanism.

Design intent: SkillsBench (arXiv:2602.12670) found that 56% of skills are never invoked because descriptions don't match user intent. This module ensures skills stay findable as your language evolves over time.

Autonomous Skill Adjustment

When corrections are detected, AceForge matches them to the active skill by temporal proximity and applies micro-revisions immediately (no approval needed):

Anti-pattern append — "User correction for exec: use --rm flag (original: docker run nginx)"
Instruction addendum — adds a note to the instructions section
Correction log — HTML comment with full correction context

After 3+ micro-revisions in 30 days, AceForge triggers a full LLM rewrite proposal (with approval).

Validation

Three modules ensure deployed skills actually work, generate realistic test scenarios, and verify the security validator itself.

Skill Health Testing

Periodic validation that installed skills reference real, working infrastructure:

CLI commands — extracted from SKILL.md, verified via which (e.g., ssh, docker, git)
File paths — extracted from backtick references, verified via existsSync
API endpoints — extracted from URLs, health-checked via HEAD request (5s timeout)

Skills that fail health tests are flagged with specific failure reasons.

Design intent: EvoSkill (arXiv:2603.02766) retains only skills that improve held-out validation performance. Health testing is the infrastructure-level equivalent — ensuring skills don't reference binaries that have been uninstalled or endpoints that have moved.

Grounded Challenges

Generates realistic test scenarios from operational context:

Query OpenViking for recent context related to each skill's tool domain
Generate task prompts grounded in real operational data
Fall back to pattern-based generation when Viking is unavailable

Challenges are logged for tracking skill activation patterns over time.

Design intent: SE-Agent (arXiv:2508.02085) demonstrates curriculum generation for progressive testing of agent capabilities. Grounded challenges prevent the "teaching to the test" problem by generating scenarios from real-world context.

Adversarial Robustness

Mutation testing against the security validator with 23 attack variants:

Category	Mutations
Prompt injection	ignore-instructions, disregard-prior, you-are-now, forget-everything, multiline-split
Credential exfil	API key, password, long token, env var exfiltration
Persistence attacks	SOUL.md write, MEMORY.md write, IDENTITY.md write
Evasion	Base64-encoded payload, homoglyph/IDN domain
Structural	Path traversal, overlength, missing name, missing description, unknown domain
Credential harvesting	Bare tilde path (~/.ssh), git credential URL, bash history read, Telegram bot token

The adversarial suite runs at startup. Results are displayed in the startup dashboard. Current: 23/23 caught.

Design intent: Chen et al. (arXiv:2602.12430) found a 26.1% vulnerability rate in community-contributed skills. The ClawHavoc campaign validated that SOUL.md/MEMORY.md targeting is the primary real-world attack vector. AceForge's adversarial suite is specifically designed around these threat models.

Security

Every generated skill passes through the security validator before you ever see it:

Prompt injection detection — catches "ignore previous instructions" and variants, including multiline split injection across numbered lists
Credential scanning — flags API keys, tokens, passwords in plaintext
Base64 payload detection — catches encoded payloads piped to shell/eval
Homoglyph/IDN domain detection — catches Cyrillic and other confusable characters in domain names
Environment variable exfiltration — detects $SECRET_KEY in URL contexts
Bare tilde path detection — catches ~/.ssh, ~/.bash_history and similar sensitive home-relative paths
Git credential URL detection — flags embedded tokens in git clone URLs (e.g., ghp_...@github.com)
Bash history read detection — catches credential harvesting from shell history files
Telegram bot token detection — flags bot tokens embedded in skill instructions
Path traversal prevention — resolves paths against workspace boundary, including backtick-wrapped paths
SOUL.md/MEMORY.md/IDENTITY.md write detection — the primary ClawHavoc attack vector
Skill conflict detection — Jaccard+bigram hybrid similarity blocks 95%+ description overlap, warns at 80%+. Proposal dedup checks name prefix, bundledTools, and existing proposals for the same tool
ClawHub dedup — checks if a skill already exists on ClawHub before proposing
Network domain allowlist — warns on unrecognized domains
LLM output size limit — generated skills capped at 50KB
Skill name validation — names with path characters rejected at proposal time
Upgrade validation — upgrades pass through the full validator before the old skill is retired
Rollback safety — retired versions are validated before the active version is deleted
LLM rate limiting — 2s interval, 8 calls/cycle max
Trace data sanitization — pattern data is sanitized before injection into LLM prompts

Commands

AceForge uses a single /forge command with subcommands:

Core

Command	Description
`/forge`	Dashboard — skills, proposals, patterns, gaps
`/forge approve <n>`	Deploy a proposed skill
`/forge reject <n>`	Reject a proposal (or `reject all`)
`/forge upgrade <n>`	Deploy upgrade, retire old (with validation)
`/forge rollback <n>`	Undo an upgrade (with validation)
`/forge retire <n>`	Retire an active skill
`/forge reinstate <n>`	Bring back a retired skill

Diagnostics

Command	Description
`/forge list`	Full inventory — active, proposed, retired
`/forge quality <n>`	Score a skill against actual usage data
`/forge gaps`	All capability gaps — tool failures + behavior + cross-session
`/forge watchdog`	Effectiveness check — flags underperformers
`/forge filtered`	What quality gates suppressed and why
`/forge preview <n>`	Human-readable skill brief before approving

Intelligence

Command	Description
`/forge tree`	Capability tree with gap scores per domain
`/forge cross_session`	Cross-session pattern analysis
`/forge compose`	Skill co-activation analysis
`/forge behavior_gaps`	Fallback / deferral / uncertainty detection
`/forge optimize`	Description-language mismatch report

Evolution

Command	Description
`/forge evolve <n>`	LLM-powered skill revision with trace delta + unified diff
`/forge distill <n>`	SRLR trace distillation report (no LLM revision)
`/forge captures`	List novel one-shot success captures
`/forge capture promote <tool>`	Promote a capture for crystallization
`/forge capture dismiss <tool>`	Dismiss a capture

History

Command	Description
`/forge history <n>`	Version history timeline
`/forge diff <n> [v]`	Unified diff between versions

Validation

Command	Description
`/forge test`	Health tests on all deployed skills
`/forge challenge`	Grounded challenge scenario generation
`/forge adversarial`	Adversarial mutation suite (23 variants)

Agent-Callable Tools

These tools are registered for programmatic use by the agent itself: forge, forge_reflect, forge_propose, forge_approve_skill, forge_reject_skill, forge_quality, forge_registry, forge_rewards, forge_tree, forge_gaps

Installation

One command:

openclaw plugins install aceforge

Then restart your gateway:

openclaw gateway restart

Verify:

openclaw plugins list | grep aceforge

Alternative install methods:

# From npm directly
npm install aceforge

# From source (for development)
git clone https://github.com/sudokrang/aceforge.git ~/.openclaw/extensions/aceforge
cd ~/.openclaw/extensions/aceforge && npm install

Configuration

Provider Agnostic

Both generator and reviewer support OpenAI-compatible (/chat/completions) and Anthropic-native (/v1/messages) endpoints. Format auto-detected from openclaw.json or provider name. Any provider works:

Provider	Base URL	Notes
MiniMax (default generator)	`https://api.minimax.io/v1`	M2.7 — strong structured output
DeepSeek (default reviewer)	`https://api.deepseek.com`	Chat — structured rubric review
OpenAI	`https://api.openai.com/v1`	GPT-4o or GPT-5.4
Anthropic	`https://api.anthropic.com`	Claude via `/v1/messages` — auto-detected
OpenRouter	`https://openrouter.ai/api/v1`	Claude, Gemini, Llama, etc.
Together	`https://api.together.xyz/v1`	Llama, Mixtral, open models
Groq	`https://api.groq.com/openai/v1`	Fast inference — Llama, Gemma
Cerebras	`https://api.cerebras.ai/v1`	Wafer-scale inference
Hugging Face	`https://api-inference.huggingface.co/v1`	Any HF Inference model
Kimi (Moonshot)	`https://api.moonshot.cn/v1`	Kimi K2.5
Ollama	`http://127.0.0.1:11434/v1`	Local — fully offline
LM Studio	`http://127.0.0.1:1234/v1`	Local — fully offline
vLLM	`http://127.0.0.1:8000/v1`	Local — high-throughput serving

Channel Agnostic

Notifications work across all 25+ OpenClaw channels. The formatting layer operates on format types, not channel names:

Format	Channels	Bold	Code
`html`	Telegram, email	`<b>`	`<code>`
`mrkdwn`	Slack	`single`	`
`markdown`	Discord, Matrix	`double`	`
`plain`	Everything else	passthrough	passthrough

Plain text with Unicode + emoji is the primary design target — rich formatting is a polish layer. Adding a new channel: one line in FORMAT_MAP.

OpenViking Compatible

AceForge is fully compatible with OpenViking for context-enriched challenge generation. Circuit breaker: 5s timeout, 3 failures → open for 10 min.

Environment Variables

Variable	Default	Description
`ACEFORGE_GENERATOR_PROVIDER`	`minimax`	Provider for skill generation
`ACEFORGE_GENERATOR_API_KEY`	from openclaw.json	API key override
`ACEFORGE_GENERATOR_MODEL`	`MiniMax-M2.7`	Model override
`ACEFORGE_GENERATOR_URL`	`https://api.minimax.io/v1`	Base URL override
`ACEFORGE_REVIEWER_PROVIDER`	`deepseek`	Provider for skill review + LLM judge
`ACEFORGE_REVIEWER_API_KEY`	from openclaw.json	API key override
`ACEFORGE_REVIEWER_MODEL`	`deepseek-chat`	Model override
`ACEFORGE_REVIEWER_URL`	`https://api.deepseek.com`	Base URL override
`ACEFORGE_NOTIFICATION_CHANNEL`	auto-detect	Force: `telegram`, `slack`, `log`
`ACEFORGE_TELEGRAM_BOT_TOKEN`	from openclaw.json	Telegram bot token
`ACEFORGE_OWNER_CHAT_ID`	from openclaw.json	Telegram chat ID
`ACEFORGE_SLACK_WEBHOOK_URL`	—	Slack incoming webhook
`ACEFORGE_VIKING_URL`	`http://127.0.0.1:1933`	OpenViking URL (optional)
`ACEFORGE_DRY_RUN`	`false`	Observation-only mode — log proposals without writing to disk
`ACEFORGE_SHARED_SKILLS`	`false`	Deploy approved skills to `~/.openclaw/skills/` (shared across all agents)

Quick Start Examples

OpenAI + Slack:

export ACEFORGE_GENERATOR_PROVIDER=openai
export ACEFORGE_GENERATOR_API_KEY=sk-...
export ACEFORGE_REVIEWER_PROVIDER=openai
export ACEFORGE_REVIEWER_MODEL=gpt-4o
export ACEFORGE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

Anthropic (Claude):

export ACEFORGE_GENERATOR_PROVIDER=anthropic
export ACEFORGE_GENERATOR_API_KEY=sk-ant-...
export ACEFORGE_REVIEWER_PROVIDER=anthropic
export ACEFORGE_REVIEWER_API_KEY=sk-ant-...

Local models via LM Studio:

export ACEFORGE_GENERATOR_PROVIDER=lmstudio
export ACEFORGE_GENERATOR_URL=http://127.0.0.1:1234/v1
export ACEFORGE_GENERATOR_API_KEY=not-needed
export ACEFORGE_REVIEWER_PROVIDER=lmstudio
export ACEFORGE_REVIEWER_URL=http://127.0.0.1:1234/v1
export ACEFORGE_REVIEWER_API_KEY=not-needed

Architecture

AceForge Architecture

File Structure

~/.openclaw/extensions/aceforge/
├── openclaw.plugin.json        # Plugin manifest + configSchema
├── index.ts                    # Entry — hooks, tools, /forge router, startup
├── tests/
│   └── test-validator.ts       # 412 assertions — validator, quality, adversarial, drift detection
└── src/
    ├── notify.ts               # Transport layer (Telegram / Slack / log) + HTML sanitizer
    ├── notify-format.ts        # FORMAT_MAP architecture — format types, not channel names
    ├── pattern/
    │   ├── constants.ts        # Canonical blocklists — TOOL, CAPTURE, NATIVE_TOOLS, SELF_TOOLS
    │   ├── store.ts            # JSONL with rotation (10K lines, 30 days, gzip)
    │   ├── capture.ts          # after_tool_call — trace + chain logging + session persistence
    │   ├── detect.ts           # Correction detection from user messages
    │   ├── analyze.ts          # Pattern analysis orchestrator — 3-path loop
    │   ├── analyze-utils.ts    # Filesystem helpers — dedup checks, file readers
    │   ├── analyze-native.ts   # Native tool sub-pattern clustering + domain extraction
    │   ├── analyze-chains.ts   # Workflow chain analysis → multi-tool skill proposals
    │   └── gap-detect.ts       # Gap analysis engine (tool-level)
    ├── skill/
    │   ├── generator.ts        # Template fallback generator
    │   ├── llm-generator.ts    # Dual-model pipeline + workflow + remediation + upgrade
    │   ├── llm-judge.ts        # LLM-as-judge for ambiguous quality scores (40-70)
    │   ├── quality-score.ts    # Deterministic structural + coverage scoring
    │   ├── validator.ts        # Security gate — 23 attack patterns + similarity + SOUL.md
    │   ├── history.ts          # Version history — recordRevision, LCS diff, timeline
    │   ├── lifecycle.ts        # Activation tracking, health cache, A/B, watchdog, baselines
    │   └── index.ts            # Skill index — metadata-only context injection (3K token budget)
    ├── intelligence/
    │   ├── capability-tree.ts  # Recursive domain categorization + gap scoring
    │   ├── cross-session.ts    # Cross-session pattern aggregation
    │   ├── composition.ts      # Co-activation detection
    │   ├── proactive-gaps.ts   # Fallback/deferral/uncertainty/infrastructure detection
    │   ├── description-optimizer.ts  # Token overlap analysis for trigger optimization
    │   └── auto-adjust.ts      # Micro-revisions from corrections
    ├── validation/
    │   ├── health-test.ts      # Verify CLIs, paths, endpoints
    │   ├── grounded-challenges.ts  # Test scenarios from Viking/patterns
    │   └── adversarial.ts      # 23 mutation variants against validator
    ├── evolution/
    │   ├── distill.ts          # SRLR trace distillation at milestones (500/2000/5000)
    │   ├── capture-novel.ts    # Novel one-shot success capture
    │   └── evolve-command.ts   # /forge evolve — LLM revision + unified diff
    └── viking/
        └── client.ts           # OpenViking context engine client (circuit breaker)

RL & Ecosystem Integration

AceForge exposes machine-readable interfaces designed for integration with frontier agentic research systems:

MetaClaw & OpenClaw-RL

The forge_registry and forge_rewards tools provide structured data for reinforcement learning training loops:

forge_registry — machine-readable skill catalog with per-skill success rates, activation counts, deployment paths, and source attribution
forge_rewards — per-skill reward signals (success rate, count, last updated) formatted for direct consumption by RL training pipelines

These interfaces are designed to support MetaClaw (arXiv:2603.17187) proxy-based meta-learning and OpenClaw-RL (arXiv:2603.10165) reinforcement learning from deployment feedback.

Capability Tree as Ecosystem Signal

The forge_tree tool returns a structured JSON capability tree with gap scores per domain. This enables ecosystem-level management: which domains need attention, where to allocate development effort, and which skills are driving the most value.

The tree structure is directly compatible with AgentSkillOS (arXiv:2603.02176)'s recursive categorization model, enabling future integration with multi-agent skill sharing and orchestration systems.

Cross-Session State

The cross-session pattern state (cross-session-patterns.json) provides a persistent view of tool usage across all communication channels. This data surface is designed for integration with memory-augmented MDP systems per Memento (arXiv:2508.16153), enabling case-based skill selection from deployment experience.

Research Basis

Every major design decision in AceForge is grounded in peer-reviewed research. The full citation table:

35 citations across 14 research areas

Concept	Paper	How AceForge Uses It
Skills fail without proper triggers	SkillsBench (Feb 2026)	Description-first prompt design; 56% invocation failure validates trigger optimization
Bad skills hurt performance	SkillsBench (Feb 2026)	Quality scoring engine; upgrade proposals when skills score < 60/100
Focused > comprehensive	SkillsBench (Feb 2026)	150-line limit; 2-3 dominant pattern focus in generator prompt
LLM skills can degrade	IoT-SkillsBench (Mar 2026)	Effectiveness watchdog; baseline comparison; auto-flagging
Hierarchical skill organization	SkillRL (Feb 2026)	Category metadata in frontmatter; domain classification
Controller-Executor-Designer	MemSkill (Feb 2026)	Analyze (controller) → Generate (executor) → Evolve (designer)
Skill co-evolution with context	MCE (Jan 2026)	Skills evolve from new trace data; trajectory-level revision
Selection degrades at scale	Single-Agent scaling (Jan 2026)	Escalating threshold; quality gating prevents library bloat
Proposer/Judge dual-model	Multi-Agent Evolve (Oct 2025)	Generator + independent Reviewer pipeline
Rubric-guided verification	DeepVerifier (Jan 2026)	Structured review criteria in reviewer prompt (8-11% improvement)
Cumulative skill creation	CASCADE (Dec 2025)	Self-evolving skill framework with human-gated deployment
Trajectory-level revision	SE-Agent (2025)	Skills revised from new data, not regenerated from scratch
Hierarchical procedural memory	MACLA, AAMAS 2026	Chain-to-workflow composition for multi-tool sequences
Skill vulnerability prevalence	Chen et al. (Feb 2026)	26.1% vulnerability rate validates adversarial testing approach
Progressive disclosure	Chen et al. (Feb 2026)	3-level architecture: metadata-only → instructions → scripts
Learned → externalized skills	Chen et al. (Feb 2026)	AceForge bridges implicit tool patterns to explicit SKILL.md files
Marketplace skill imbalance	Ling et al. (Feb 2026)	Quality scoring + upgrade proposals for underperforming skills
Proxy-based meta-learning	MetaClaw (Mar 2026)	Registry + rewards tools for MetaClaw/OpenClaw-RL integration
Inter-task skill evolution	Fang et al. Survey (Aug 2025)	Workflow consolidation across sessions
Procedural + semantic memory	Jeunen et al. (May 2025)	Gap analysis augments with failure-driven awareness
Supply chain attack at scale	ClawHavoc / Antiy CERT (Feb 2026)	1,184 malicious skills; SOUL.md write detection + adversarial testing
Capability tree at ecosystem scale	AgentSkillOS (Mar 2026)	Recursive categorization; tree-based retrieval; gap scoring
Read-Write Reflective Learning	Memento-Skills (Mar 2026)	Cross-session propagation; autonomous skill adjustment
Failure-driven skill discovery	EvoSkill (Mar 2026)	Proactive gap detection; health validation
Memory-augmented MDP	Memento (2025)	Case-based reasoning for skill selection from deployment experience
Self-evolving agent framework	Self-Evolving Agents Survey (Jul 2025)	Comprehensive framework: environment, experience, self evolution
RL from deployment feedback	OpenClaw-RL (Mar 2026)	forge_rewards tool provides RL-compatible reward signals
DAG-based pipeline composition	AgentSkillOS (Mar 2026)	Co-activation detection for future DAG orchestration
Multi-agent skill sharing	AgentSkillOS (Mar 2026)	Capability tree structure for multi-agent coordination
Skill persistence as memory	Memento-Skills (Mar 2026)	Skills persist across sessions as evolving procedural memory
Milestone-based skill accumulation	SAGE (Dec 2025)	Sequential Rollout — distillation at 500/2000/5000 activation milestones
Summarize–Reflect–Locate–Revise	K2-Agent (Mar 2026)	SRLR loop for trace distillation and knowledge refinement
On-policy skill preservation	SDFT (Jan 2026)	Self-distillation preserves prior capabilities during evolution
Self-verification before library add	Voyager (May 2023)	Novel capture validates first-time successes before queuing
Autonomous experiential learning	SEAgent (Aug 2025)	Specialist-to-generalist training; curriculum-based task generation

What AceForge Is

AceForge is a skill engine. It generates, validates, and manages SKILL.md files — permanent, auditable artifacts crystallized from your agent's actual behavior.

It is not a memory system, a prompt optimizer, or an RL trainer. AceForge produces one thing: validated SKILL.md files crystallized from your agent's real operational patterns.

Requirements

OpenClaw 2026.3.22 or later
Node.js 22+
At least one OpenAI-compatible LLM API key

Traces auto-rotate at 10K lines or 30 days (whichever comes first) with gzip archival. No manual cleanup needed.

Contributing

Contributions are welcome. Please open an issue first to discuss what you'd like to change.

If you're running the test suite:

npx tsx tests/test-validator.ts

All tool blocklists are defined in src/pattern/constants.ts. The test suite enforces zero-drift: if you add a new tool to any blocklist, all source files must import from constants.ts or tests will fail. This is intentional.

License

MIT — see LICENSE

_{Built by sudokrang · Grounded in peer-reviewed research · Nothing deploys without your approval}

Aceforge

Install

Table of Contents

Why AceForge Exists

What AceForge Is

What AceForge Is Not

How It Works

Observation & Pattern Detection

Dual-Model LLM Pipeline

Quality Scoring & Hybrid LLM Judge

Skill Evolution & Lifecycle

Three Analysis Paths

Evolution Over Regeneration

Milestone Distillation

Novel Success Capture

Maturity Stages

Intelligence

Capability Tree

Cross-Session Propagation

Skill Composition Detection

Proactive Gap Detection

Description Optimization

Autonomous Skill Adjustment

Validation

Skill Health Testing

Grounded Challenges

Adversarial Robustness

Security

Commands

Core

Diagnostics

Intelligence

Evolution

History

Validation

Agent-Callable Tools

Installation

Configuration

Provider Agnostic

Channel Agnostic

OpenViking Compatible

Architecture

RL & Ecosystem Integration

MetaClaw & OpenClaw-RL

Capability Tree as Ecosystem Signal

Cross-Session State

Research Basis

What AceForge Is

Requirements

Contributing

License