prompt-engineer-toolkit

Analyzes and rewrites prompts for better AI output, creates reusable prompt templates for marketing use cases (ad copy, email campaigns, social media), and s...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 164 · 2 current installs · 2 all-time installs

byAlireza Rezvani@alirezarezvani

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

The name/description (prompt evaluation, templates, versioning) align with the included files and scripts. The two Python scripts implement A/B testing and local prompt versioning as claimed. No unrelated environment variables, binaries, or external services are required by default.

ℹ

Instruction Scope

SKILL.md instructs the agent/operator to run local scripts and optionally provide a runner command template. The prompt_tester script will execute any user-supplied --runner-cmd via subprocess.run (after shlex.split). That behavior is expected for a toolkit that can invoke an external LLM CLI, but it means the skill will run arbitrary external commands if a runner_cmd is provided. The instructions themselves do not command reading unrelated system files or exfiltrating data, but misuse of runner_cmd could cause that.

✓

Install Mechanism

No install spec — instruction-only with included scripts. Nothing is downloaded or written by an installer; scripts are plain Python and run locally.

✓

Credentials

The skill requires no credentials or environment variables. It stores versions in a local JSONL file by default (.prompt_versions.jsonl). No unexpected secret-access patterns are present in the code or SKILL.md.

✓

Persistence & Privilege

always is false. The only persistent artifact is a local JSONL store (default path is configurable) for prompt versions. The skill does not modify other skills or system-wide agent settings.

Assessment

This skill appears to do what it says, but review before use: (1) Inspect the scripts yourself — they run locally and are readable. (2) Be careful when supplying --runner-cmd: the script will execute whatever command you provide (so avoid embedding secrets or using networked commands you don't trust). (3) The versioner writes a local JSONL file (.prompt_versions.jsonl by default); ensure it is stored in a safe directory and that sensitive prompt content (PII, secrets) is handled appropriately. (4) If you plan to let an autonomous agent call this skill, restrict or validate any runner_cmd the agent could construct to avoid accidental execution of arbitrary system/network commands. (5) If you want additional assurance, run the tools in an isolated environment or container and/or have an administrator review the code before deployment.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk974were8prf5dqa4eb39a15zh82pdwx

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Prompt Engineer Toolkit

Overview

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

Core Capabilities

A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates

Key Workflows

1. Run Prompt A/B Test

Prepare JSON test cases and run:

python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text

Input can also come from stdin/--input JSON payload.

2. Choose Winner With Evidence

The tester scores outputs per case and aggregates:

expected content coverage
forbidden content violations
regex/format compliance
output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

3. Version Prompts

# Add version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/support_v3.txt \
  --author alice

# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3

# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier

4. Regression Loop

Store baseline version.
Propose prompt edits.
Re-run A/B test.
Promote only if score and safety constraints improve.

Script Interfaces

python3 scripts/prompt_tester.py --help
- Reads prompts/cases from stdin or --input
- Optional external runner command
- Emits text or JSON metrics
python3 scripts/prompt_versioner.py --help
- Manages prompt history (add, list, diff, changelog)
- Stores metadata and content snapshots locally

Pitfalls, Best Practices & Review Checklist

Avoid these mistakes:

Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
Changing prompt and model simultaneously — always isolate variables.
Missing must_not_contain (forbidden-content) checks in evaluation criteria.
Editing prompts without version metadata, author, or change rationale.
Skipping semantic diffs before deploying a new prompt version.
Optimizing one benchmark while harming edge cases — track the full suite.
Model swap without rerunning the baseline A/B suite.

Before promoting any prompt, confirm:

Task intent is explicit and unambiguous.
Output schema/format is explicit.
Safety and exclusion constraints are explicit.
No contradictory instructions.
No unnecessary verbosity tokens.
A/B score improves and violation count stays at zero.

References

Evaluation Design

Each test case should define:

input: realistic production-like input
expected_contains: required markers/content
forbidden_contains: disallowed phrases or unsafe content
expected_regex: required structural patterns

This enables deterministic grading across prompt variants.

Versioning Policy

Use semantic prompt identifiers per feature (support_classifier, ad_copy_shortform).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.

Rollout Strategy

Create baseline prompt version.
Propose candidate prompt.
Run A/B suite against same cases.
Promote only if winner improves average and keeps violation count at zero.
Track post-release feedback and feed new failure cases back into test suite.

Files

7 total

Select a file

Select a file to preview.

Comments

Loading comments…