Prompt Version Control

Other

Git-like version control for AI prompts enabling versioning, semantic diffs, A/B tests, metric tracking, rollback, and remote collaboration.

Install

openclaw skills install @harrylabsj/prompt-version-control

Prompt Version Control (Prompt 版本控制器)

Git-inspired version control system for AI prompts. Track every edit, run A/B tests, measure quality metrics, and rollback instantly — treat your prompts like code.

Core Capabilities

  • Version tracking: Every prompt change auto-commits with semantic versioning (major.minor.patch)
  • A/B testing: Run v1 vs v2 side-by-side, measure response quality, latency, and token cost
  • Diff engine: Compare prompt versions with semantic-aware diffs (not just text)
  • Rollback: Instantly revert to any previous version with full history preservation
  • Remote sync: Push/pull prompts to GitHub/GitLab for team collaboration
  • Metrics dashboard: Track improvement/degradation trends across versions
  • Conflict resolution: Merge divergent prompt branches with structured conflict markers

Workflow (9 Steps)

Step 1: Initialize Prompt Repository

Input: prompt init [directory] — user runs init command in a project or empty directory. Output: Creates .prompt/ directory structure:

.prompt/
  config.yaml      # repo settings, LLM provider config, test dataset path
  prompts/         # individual prompt files
  history/         # version history (git-compatible)
  metrics/         # A/B test results
  branches/        # branch references

Logic: Auto-detect if already inside a git repo; if so, integrate .prompt/ as a subdirectory. Generate initial config.yaml with sensible defaults. Prompt user for LLM API key if not in environment.

Step 2: Register a New Prompt

Input: prompt add <name> [--template <type>] [--description "..."] Output: Creates prompts/<name>.yaml with metadata and initial version v0.1.0.

name: customer-support-classifier
version: 0.1.0
description: Classify customer inquiries into 5 categories
model: gpt-4
temperature: 0.3
system: |
  You are a customer support classifier...
user_template: |
  {{query}}
variables:
  - name: query
    type: string
    required: true
test_cases:
  - input: "Where is my order #12345?"
    expected_category: "order_status"
metrics:
  quality_score: null
  avg_latency_ms: null
  avg_tokens: null

Logic: Templates include chat, classifier, generator, extractor, custom. Auto-extract variables from {{...}} patterns.

Step 3: Edit and Auto-Version

Input: User edits prompts/<name>.yaml directly or via prompt edit <name>. Output: On save, auto-increments version:

  • Patch bump (0.1.0 → 0.1.1): wording changes, examples, minor parameter tweaks
  • Minor bump (0.1.0 → 0.2.0): new variables, restructured prompt, changed model
  • Major bump (0.1.0 → 1.0.0): fundamentally different approach, breaking output format change

Logic: LLM-assisted semantic diff determines bump magnitude. User can override: prompt edit <name> --bump major.

Step 4: Run A/B Test

Input: prompt test <name> — runs the current version against the previous version. Action:

  1. Load test cases from test_cases in the prompt YAML
  2. Send each test case to both prompt versions
  3. Collect responses and compute metrics

Output: A/B comparison table.

Test CaseMetricv0.1.2v0.1.3Δ
order_statusQuality (1-10)8.29.1+0.9 ⬆
order_statusLatency (ms)12401180-60 ⬇
order_statusTokens340312-28 ⬇
refund_requestQuality7.57.3-0.2 ⬇
...............
Overall Quality7.98.0+0.1

Logic: Quality scoring uses LLM-as-judge with predefined rubrics. Statistical significance check (p < 0.05) when ≥20 test cases. Flag degradation in red.

Step 5: Diff Two Versions

Input: prompt diff <name> v0.1.2 v0.1.3 Output: Semantic-aware diff highlighting:

  • Text changes: Standard line diff with context
  • Structural changes: Added/removed variables, parameter changes
  • Intent changes: LLM-summarized description of what changed and why it matters
--- customer-support-classifier v0.1.2
+++ customer-support-classifier v0.1.3
@@ system @@
- You are a helpful customer support classifier.
+ You are an expert customer support triage agent with 10 years of experience.

@@ variables @@
+ added: priority_level (enum: low, medium, high, urgent)

Summary: Added urgency classification dimension and elevated persona specificity.

Step 6: View Version History

Input: prompt log <name> [--limit N] Output: Git-log-style history with metrics overlay.

v0.3.0 (2026-06-15)  Alice  Added priority classification, bumped to gpt-4o
v0.2.1 (2026-06-12)  Bob    Fixed edge case: empty query → graceful fallback
v0.2.0 (2026-06-10)  Alice  Added examples for refund flow
v0.1.0 (2026-06-01)  Alice  Initial prompt
---
Quality trend: ████▌ 7.2 → 7.9 → 8.4 → 9.1

Step 7: Rollback

Input: prompt rollback <name> --to v0.2.1 Output: Restores v0.2.1 as current working version, creates a new commit marking the rollback. Logic: Rollback is itself a versioned action (bumps patch). History is never destroyed. prompt rollback <name> --undo to return to pre-rollback state.

Step 8: Remote Sync

Input: prompt push [--remote origin] / prompt pull Output: Syncs .prompt/ to/from configured remote (GitHub/GitLab). Logic: Standard git push/pull under the hood. Merge conflicts surfaced with structured markers for manual resolution. prompt merge --tool opens interactive merge UI.

Step 9: Generate Iteration Report

Input: prompt report <name> [--from v0.1.0] [--format markdown|html] Output: Full version history report with:

  • Version timeline (Mermaid chart)
  • Quality score trend (sparkline)
  • Token cost trend
  • Top 3 most impactful changes (by quality delta)
  • Regression alerts

Sample Prompts

Prompt 1: Initialize and First Prompt

User: prompt init ./my-prompts && prompt add email-generator --template generator --description "Generate marketing emails" Expected Output: Repository created, first prompt registered at v0.1.0.

Prompt 2: A/B Test

User: prompt test email-generator Expected Output: Side-by-side comparison of current vs previous version across all test cases, with overall quality delta.

Prompt 3: Rollback After Degradation

User: prompt rollback email-generator --to v1.2.0 Expected Output: v1.2.0 restored as working version, commit logged. "Rolled back from v1.3.1 to v1.2.0 (quality dropped 12% in v1.3.0)".

Prompt 4: Diff Understanding

User: prompt diff email-generator v1.2.0 v1.3.0 Expected Output: Semantic diff with text changes, structural changes, and LLM-generated summary of what changed.

Prompt 5: Team Collaboration

User: prompt push (after editing prompts locally) then prompt pull (on teammate's machine) Expected Output: Remote sync with conflict markers if both edited same prompt.

Prompt 6: Full Report

User: prompt report email-generator --from v0.1.0 --format markdown Expected Output: Complete iteration history with quality/cost trends and top-impact changes.

Real Task Examples

Example 1: Solo Developer Iterating

Scenario: Developer building a customer-facing chatbot, iterating the system prompt daily. Input: Series of prompt edit sessions over 2 weeks, with periodic prompt test runs. Steps:

  1. prompt init → repo created
  2. prompt add chatbot --template chat → v0.1.0
  3. Edit 5 times over week 1 → versions 0.1.1 through 0.3.0
  4. prompt test chatbot → discovers v0.2.1 had best quality (9.2)
  5. prompt rollback chatbot --to v0.2.1 → restores best version
  6. Continue iterating from v0.2.1 → v0.4.0 surpasses old best
  7. prompt report chatbot → shows quality journey: "V-shaped recovery after rollback" Output: 14 versions tracked, best version identified, recovery path documented.

Example 2: Team Prompt Collaboration

Scenario: 3-person AI team managing 50+ prompts for a product. Input: Multiple team members editing prompts, pushing/pulling. Steps:

  1. Alice: prompt init + prompt add pricing-prompt → pushes to GitHub
  2. Bob: prompt pull → gets pricing-prompt v0.1.0
  3. Alice edits → v0.2.0, Bob edits → v0.2.0-bob (branch)
  4. prompt push from both → merge conflict detected
  5. prompt merge --tool → interactive resolution showing both versions side-by-side
  6. Resolved → v0.3.0 on main
  7. prompt test pricing-prompt → validates merged version Output: Conflict resolved, merged version tested, team workflow established.

Example 3: Production Rollback Emergency

Scenario: Production chatbot quality suddenly drops after latest prompt deploy. Input: Alert from monitoring: user satisfaction down 15%. Steps:

  1. prompt log chatbot --limit 3 → identifies v2.4.0 as latest deploy
  2. prompt diff chatbot v2.3.0 v2.4.0 → shows new "be more concise" instruction caused incomplete answers
  3. prompt rollback chatbot --to v2.3.0 → instant revert
  4. Verification: quality metrics return to baseline within minutes
  5. Post-mortem: prompt report chatbot --from v2.3.0 documents the incident Output: Rollback completed in <1 minute. Incident documented for team review.

🚀 First-Success Path (3 Steps)

  1. Step 1: Run prompt init ./my-prompts && prompt add hello --template chat
  2. Step 2: Edit the prompt, then run prompt log hello to see your first version
  3. Step 3: Edit again, run prompt diff hello v0.1.0 v0.2.0 — see your changes tracked instantly

Boundary Conditions

ConditionBehavior
Prompt file deleted manuallyDetect in next prompt log, offer recovery from .prompt/history/
Concurrent edits (team)Merge conflict on push; structured markers for resolution
Empty test_casesWarn; A/B test requires ≥1 test case, proceed with manual review mode
LLM API key missingTest commands fail gracefully; editing/log/diff still work
Large repository (>500 prompts)Pagination on prompt log --all; recommend splitting into sub-repos
Git remote not configuredprompt push prompts to set remote URL
Model change (gpt-4 → gpt-4o)Auto-detected as minor bump; flag in diff as "model change"
Binary/incompatible changesWarn if output schema changes; recommend major version bump

Error Handling

Error CodeScenarioHandling
E-NOT-INITCommand run outside a prompt repo"No prompt repo found. Run prompt init first."
E-PROMPT-NOT-FOUNDReferenced prompt name doesn't existShow similar prompt names (Levenshtein distance)
E-VERSION-NOT-FOUNDReferenced version doesn't existShow available versions for that prompt
E-MERGE-CONFLICTPush/pull conflict detectedShow conflicting sections, offer prompt merge --tool
E-API-FAILLLM API call fails during testSkip failed test case, report in results, don't block remaining
E-TEST-INSUFFICIENTA/B test with <10 test casesShow results but flag low confidence

Security Requirements

  • API key storage: Store in environment variables or OS keychain only; never in .prompt/config.yaml or git history
  • Prompt content privacy: Prompt files may contain proprietary business logic; respect .gitignore patterns
  • Team access control: Remote sync via standard git permissions; no additional auth layer
  • Production data safety: Test cases should use synthetic or anonymized data; never real user data in version control
  • Audit trail: All version changes are immutable and attributed; no history rewriting

Implementation

Project Structure

FilePurpose
SKILL.mdFull design document (this file)
skill.jsonSkill metadata with script/schema references
scripts/prompt-vc.shMain CLI script — implements all workflow steps
schemas/input.schema.jsonJSON Schema for prompt YAML files
schemas/output.schema.jsonJSON Schema for test results / diff / log output
references/config.yamlDefault .prompt/config.yaml template

CLI Usage

# Initialise repository
./scripts/prompt-vc.sh init ./my-prompts

# Add a prompt with a template
./scripts/prompt-vc.sh add email-generator --template generator

# Edit (opens $EDITOR)
./scripts/prompt-vc.sh edit email-generator

# Diff two versions
./scripts/prompt-vc.sh diff email-generator v0.1.0 v0.2.0

# View version history
./scripts/prompt-vc.sh log email-generator

# Run A/B test
./scripts/prompt-vc.sh test email-generator

# Rollback
./scripts/prompt-vc.sh rollback email-generator --to v0.1.0

# Generate report
./scripts/prompt-vc.sh report email-generator

Dependencies

  • bash 4+ (macOS: modern bash via Homebrew, or use default system bash)
  • diff (standard Unix utility)
  • python3 (optional — used for YAML parsing in test/report)
  • git (optional — auto-detected for .gitignore integration)
  • $EDITOR (defaults to vi; set EDITOR env var to customise)

All test output is simulated offline (no LLM API calls). The A/B test engine generates deterministic metrics based on prompt length to validate the CLI workflow without requiring API keys.