Install
openclaw skills install prompt-evaluationEvaluate and benchmark AI prompts for quality, consistency, and performance. Triggers: prompt evaluation, prompt testing, prompt quality, prompt benchmark, p...
openclaw skills install prompt-evaluationEvaluate and benchmark AI prompts for quality, consistency, and performance. Score, compare, and optimize your prompts systematically.
A prompt evaluation framework that helps agents measure prompt quality across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and output consistency. Compare prompt variants and find the optimal version.
node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness
node evaluate.js score --prompt-file ./prompts/ --output scores.json
Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).
node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50
node evaluate.js compare --config ab-test-config.json
Run statistical A/B tests between prompt variants with significance analysis.
node evaluate.js consistency --prompt "Translate to French" --runs 100 --variance-threshold 0.15
node evaluate.js consistency --temperature 0.7 --top-p 0.9
Measures output consistency across multiple runs to find the most stable prompts.
node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-set.jsonl
node evaluate.js regression --fail-on-degradation 5%
Detects quality regressions between prompt versions using golden test sets.
node evaluate.js cost --prompt "Long prompt..." --model gpt-4 --estimate-tokens
node evaluate.js cost --compare-prompts --output cost-report.csv
Estimates token usage and costs for different prompt variants and models.
{
"evaluation": {
"dimensions": ["clarity", "specificity", "robustness", "cost"],
"scoringModel": "gpt-4",
"abTest": {
"trials": 50,
"significanceLevel": 0.05
},
"consistency": {
"runs": 100,
"varianceThreshold": 0.15
},
"regression": {
"degradationThreshold": "5%",
"goldenSet": "./golden-set.jsonl"
}
}
}