Install
openclaw skills install @harrylabsj/prompt-version-controlGit-like version control for AI prompts enabling versioning, semantic diffs, A/B tests, metric tracking, rollback, and remote collaboration.
openclaw skills install @harrylabsj/prompt-version-controlGit-inspired version control system for AI prompts. Track every edit, run A/B tests, measure quality metrics, and rollback instantly — treat your prompts like code.
Input: prompt init [directory] — user runs init command in a project or empty directory.
Output: Creates .prompt/ directory structure:
.prompt/
config.yaml # repo settings, LLM provider config, test dataset path
prompts/ # individual prompt files
history/ # version history (git-compatible)
metrics/ # A/B test results
branches/ # branch references
Logic: Auto-detect if already inside a git repo; if so, integrate .prompt/ as a subdirectory. Generate initial config.yaml with sensible defaults. Prompt user for LLM API key if not in environment.
Input: prompt add <name> [--template <type>] [--description "..."]
Output: Creates prompts/<name>.yaml with metadata and initial version v0.1.0.
name: customer-support-classifier
version: 0.1.0
description: Classify customer inquiries into 5 categories
model: gpt-4
temperature: 0.3
system: |
You are a customer support classifier...
user_template: |
{{query}}
variables:
- name: query
type: string
required: true
test_cases:
- input: "Where is my order #12345?"
expected_category: "order_status"
metrics:
quality_score: null
avg_latency_ms: null
avg_tokens: null
Logic: Templates include chat, classifier, generator, extractor, custom. Auto-extract variables from {{...}} patterns.
Input: User edits prompts/<name>.yaml directly or via prompt edit <name>.
Output: On save, auto-increments version:
Logic: LLM-assisted semantic diff determines bump magnitude. User can override: prompt edit <name> --bump major.
Input: prompt test <name> — runs the current version against the previous version.
Action:
test_cases in the prompt YAMLOutput: A/B comparison table.
| Test Case | Metric | v0.1.2 | v0.1.3 | Δ |
|---|---|---|---|---|
| order_status | Quality (1-10) | 8.2 | 9.1 | +0.9 ⬆ |
| order_status | Latency (ms) | 1240 | 1180 | -60 ⬇ |
| order_status | Tokens | 340 | 312 | -28 ⬇ |
| refund_request | Quality | 7.5 | 7.3 | -0.2 ⬇ |
| ... | ... | ... | ... | ... |
| Overall Quality | 7.9 | 8.0 | +0.1 |
Logic: Quality scoring uses LLM-as-judge with predefined rubrics. Statistical significance check (p < 0.05) when ≥20 test cases. Flag degradation in red.
Input: prompt diff <name> v0.1.2 v0.1.3
Output: Semantic-aware diff highlighting:
--- customer-support-classifier v0.1.2
+++ customer-support-classifier v0.1.3
@@ system @@
- You are a helpful customer support classifier.
+ You are an expert customer support triage agent with 10 years of experience.
@@ variables @@
+ added: priority_level (enum: low, medium, high, urgent)
Summary: Added urgency classification dimension and elevated persona specificity.
Input: prompt log <name> [--limit N]
Output: Git-log-style history with metrics overlay.
v0.3.0 (2026-06-15) Alice Added priority classification, bumped to gpt-4o
v0.2.1 (2026-06-12) Bob Fixed edge case: empty query → graceful fallback
v0.2.0 (2026-06-10) Alice Added examples for refund flow
v0.1.0 (2026-06-01) Alice Initial prompt
---
Quality trend: ████▌ 7.2 → 7.9 → 8.4 → 9.1
Input: prompt rollback <name> --to v0.2.1
Output: Restores v0.2.1 as current working version, creates a new commit marking the rollback.
Logic: Rollback is itself a versioned action (bumps patch). History is never destroyed. prompt rollback <name> --undo to return to pre-rollback state.
Input: prompt push [--remote origin] / prompt pull
Output: Syncs .prompt/ to/from configured remote (GitHub/GitLab).
Logic: Standard git push/pull under the hood. Merge conflicts surfaced with structured markers for manual resolution. prompt merge --tool opens interactive merge UI.
Input: prompt report <name> [--from v0.1.0] [--format markdown|html]
Output: Full version history report with:
User: prompt init ./my-prompts && prompt add email-generator --template generator --description "Generate marketing emails"
Expected Output: Repository created, first prompt registered at v0.1.0.
User: prompt test email-generator
Expected Output: Side-by-side comparison of current vs previous version across all test cases, with overall quality delta.
User: prompt rollback email-generator --to v1.2.0
Expected Output: v1.2.0 restored as working version, commit logged. "Rolled back from v1.3.1 to v1.2.0 (quality dropped 12% in v1.3.0)".
User: prompt diff email-generator v1.2.0 v1.3.0
Expected Output: Semantic diff with text changes, structural changes, and LLM-generated summary of what changed.
User: prompt push (after editing prompts locally) then prompt pull (on teammate's machine)
Expected Output: Remote sync with conflict markers if both edited same prompt.
User: prompt report email-generator --from v0.1.0 --format markdown
Expected Output: Complete iteration history with quality/cost trends and top-impact changes.
Scenario: Developer building a customer-facing chatbot, iterating the system prompt daily.
Input: Series of prompt edit sessions over 2 weeks, with periodic prompt test runs.
Steps:
prompt init → repo createdprompt add chatbot --template chat → v0.1.0prompt test chatbot → discovers v0.2.1 had best quality (9.2)prompt rollback chatbot --to v0.2.1 → restores best versionprompt report chatbot → shows quality journey: "V-shaped recovery after rollback"
Output: 14 versions tracked, best version identified, recovery path documented.Scenario: 3-person AI team managing 50+ prompts for a product. Input: Multiple team members editing prompts, pushing/pulling. Steps:
prompt init + prompt add pricing-prompt → pushes to GitHubprompt pull → gets pricing-prompt v0.1.0prompt push from both → merge conflict detectedprompt merge --tool → interactive resolution showing both versions side-by-sideprompt test pricing-prompt → validates merged version
Output: Conflict resolved, merged version tested, team workflow established.Scenario: Production chatbot quality suddenly drops after latest prompt deploy. Input: Alert from monitoring: user satisfaction down 15%. Steps:
prompt log chatbot --limit 3 → identifies v2.4.0 as latest deployprompt diff chatbot v2.3.0 v2.4.0 → shows new "be more concise" instruction caused incomplete answersprompt rollback chatbot --to v2.3.0 → instant revertprompt report chatbot --from v2.3.0 documents the incident
Output: Rollback completed in <1 minute. Incident documented for team review.prompt init ./my-prompts && prompt add hello --template chatprompt log hello to see your first versionprompt diff hello v0.1.0 v0.2.0 — see your changes tracked instantly| Condition | Behavior |
|---|---|
| Prompt file deleted manually | Detect in next prompt log, offer recovery from .prompt/history/ |
| Concurrent edits (team) | Merge conflict on push; structured markers for resolution |
| Empty test_cases | Warn; A/B test requires ≥1 test case, proceed with manual review mode |
| LLM API key missing | Test commands fail gracefully; editing/log/diff still work |
| Large repository (>500 prompts) | Pagination on prompt log --all; recommend splitting into sub-repos |
| Git remote not configured | prompt push prompts to set remote URL |
| Model change (gpt-4 → gpt-4o) | Auto-detected as minor bump; flag in diff as "model change" |
| Binary/incompatible changes | Warn if output schema changes; recommend major version bump |
| Error Code | Scenario | Handling |
|---|---|---|
| E-NOT-INIT | Command run outside a prompt repo | "No prompt repo found. Run prompt init first." |
| E-PROMPT-NOT-FOUND | Referenced prompt name doesn't exist | Show similar prompt names (Levenshtein distance) |
| E-VERSION-NOT-FOUND | Referenced version doesn't exist | Show available versions for that prompt |
| E-MERGE-CONFLICT | Push/pull conflict detected | Show conflicting sections, offer prompt merge --tool |
| E-API-FAIL | LLM API call fails during test | Skip failed test case, report in results, don't block remaining |
| E-TEST-INSUFFICIENT | A/B test with <10 test cases | Show results but flag low confidence |
.prompt/config.yaml or git history.gitignore patterns| File | Purpose |
|---|---|
SKILL.md | Full design document (this file) |
skill.json | Skill metadata with script/schema references |
scripts/prompt-vc.sh | Main CLI script — implements all workflow steps |
schemas/input.schema.json | JSON Schema for prompt YAML files |
schemas/output.schema.json | JSON Schema for test results / diff / log output |
references/config.yaml | Default .prompt/config.yaml template |
# Initialise repository
./scripts/prompt-vc.sh init ./my-prompts
# Add a prompt with a template
./scripts/prompt-vc.sh add email-generator --template generator
# Edit (opens $EDITOR)
./scripts/prompt-vc.sh edit email-generator
# Diff two versions
./scripts/prompt-vc.sh diff email-generator v0.1.0 v0.2.0
# View version history
./scripts/prompt-vc.sh log email-generator
# Run A/B test
./scripts/prompt-vc.sh test email-generator
# Rollback
./scripts/prompt-vc.sh rollback email-generator --to v0.1.0
# Generate report
./scripts/prompt-vc.sh report email-generator
.gitignore integration)vi; set EDITOR env var to customise)All test output is simulated offline (no LLM API calls). The A/B test engine generates deterministic metrics based on prompt length to validate the CLI workflow without requiring API keys.