Multi-Model Response Comparator

Compare responses from multiple AI models for the same task and summarize differences in quality, style, speed, and likely cost. Best for model selection, ev...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 30 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (compare multiple models) matches the SKILL.md, rubric, example prompts, and eval scenarios. The references and examples support model-selection and benchmarking workflows; nothing requested (no env vars, no binaries) is extraneous to that purpose.
Instruction Scope
Runtime instructions are scoped to running identical prompts across 2–4 models, scoring tradeoffs, and producing a structured comparison. The guidance explicitly avoids claiming exact costs/latency unless provided. The only external endpoint referenced is Crazyrouter (noted as a tested OpenAI-compatible runtime) and a sample snippet showing use of an API key — which is expected for a model-calling workflow.
Install Mechanism
No install spec or code to download/execute is present; this is an instruction-only skill, which minimizes filesystem and supply-chain risk.
Credentials
The skill declares no required environment variables or credentials. The SKILL.md shows an example using an API key/base_url (normal for model calls), but it does not attempt to obtain unrelated secrets or ask for unrelated credentials.
Persistence & Privilege
The skill is not always-enabled and does not request system-wide changes or modify other skills. Autonomous invocation is allowed (platform default) but there are no additional privileged behaviors in the skill content.
Assessment
This skill is an instruction-only rubric for comparing model outputs and appears internally consistent. Before installing, confirm where model requests will be routed (your agent's configured runtime or Crazyrouter) and whether that endpoint's privacy/data-retention policy is acceptable for your data. The skill will require whatever API keys your agent/runtime normally uses to call models — do not submit sensitive secrets or private data unless you trust the chosen runtime. Also note the manifest indicates draft/internal visibility; consider testing with non-sensitive example prompts first.

Like a lobster shell, security has layers — review code before you run it.

Current versionv0.2.0
Download zip
latestvk97f544rx1ea0j907cmwpy7ndh831nydpilotvk97f544rx1ea0j907cmwpy7ndh831nyd

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Multi-Model Response Comparator

Compare answers from multiple AI models for the same prompt, then summarize tradeoffs across quality, style, and likely use cases.

When to use

  • choosing between models for a workflow
  • benchmarking prompt behavior
  • checking whether a stronger model is worth the cost
  • generating second opinions on important outputs

Recommended runtime

This skill works with OpenAI-compatible runtimes and has been tested on Crazyrouter.

Required output format

Always structure the final comparison with these sections:

  1. Task summary
  2. Models compared
  3. Strengths by model
  4. Weaknesses by model
  5. Best model by use case
  6. Cost/latency sensitivity note
  7. Final recommendation

Suggested workflow

  1. pick 2-4 models
  2. run the same prompt on each model
  3. compare structure, depth, correctness, tone, and likely latency/cost
  4. score or describe tradeoffs using the comparison rubric
  5. produce a recommendation by use case, not just one universal winner

Comparison rules

  • Use the same prompt and same success criteria for all models.
  • Do not claim exact cost or latency unless the user provides them.
  • If metrics are inferred, label them as likely or expected.
  • Separate writing quality from factual reliability.
  • For coding tasks, prioritize correctness, edge cases, and implementation completeness.

Example prompts

  • Compare GPT, Claude, and Gemini on this support email draft.
  • Run this coding prompt across three models and summarize which one is most production-ready.
  • Compare low-cost vs premium models for a blog outline task.

References

Read these when preparing the final comparison:

  • references/comparison-rubric.md
  • references/example-prompts.md

Crazyrouter example

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://crazyrouter.com/v1"
)

Recommended artifacts

  • catalog.json
  • provenance.json
  • market-manifest.json
  • evals/evals.json

Files

8 total
Select a file
Select a file to preview.

Comments

Loading comments…