modelshow

SuspiciousAudited by ClawScan on May 10, 2026.

Overview

ModelShow is mostly aligned with its stated multi-model comparison purpose, but its workflow exposes the judge to the de-anonymization map and uses unsafe shell-style JSON piping with model-generated text.

Review this skill before installing if you need strict blind evaluation or handle sensitive prompts. If you use it, limit the configured models, avoid secrets or private files, inspect where results are saved, and prefer a version that separates judging from de-anonymization and avoids shell interpolation of generated text.

Findings (5)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Users may trust the rankings as unbiased double-blind results even though the judge can see model identities during the judging task.

Why it was flagged

The skill claims the judge evaluates blindly, but the judge sub-agent is instructed in the same task to use the anonymization map, which reveals the model behind each placeholder.

Skill content
ModelShow provides ... double-blind evaluation ... uses an independent judge model to rank responses purely on merit. ... After writing your judgment, run this command: ... "anonymization_map": {anonymization_map}
Recommendation

Separate judging from de-anonymization so the judge model never receives the anonymization map; perform de-anonymization only after the judge output is returned.

ConcernMedium Confidence
ASI05: Unexpected Code Execution
What this means

A crafted prompt or model response could cause the agent to run malformed or unintended shell commands if it follows this pattern literally.

Why it was flagged

The workflow shows shell commands that inline model responses and judge text into single-quoted JSON piped to Python; model or judge text containing quotes or shell metacharacters could break the command or be interpreted unexpectedly.

Skill content
echo '{ ... "responses": {model: response_dict}, ... }' | python3 {baseDir}/judge_pipeline.py ... "judge_output": "[YOUR JUDGMENT TEXT HERE]"
Recommendation

Avoid shell interpolation for generated text. Pass JSON via a safely written temporary file, structured tool input, or Python subprocess stdin with proper serialization.

What this means

Any sensitive prompt content or fetched context may be shared with several configured models and the judge model.

Why it was flagged

The skill intentionally sends the user prompt to multiple model agents and then sends collected responses to a judge agent.

Skill content
Spawn Parallel Model Agents ... For each model in config.models ... Task: {config.systemPrompt} {extracted user prompt} ... Spawn Judge+Deanon Sub-Agent
Recommendation

Use this skill only with prompts and referenced files you are comfortable sending to all configured models; reduce the model list for sensitive work.

What this means

Prompts, responses, judge commentary, and metadata may remain on disk after the comparison, including sensitive content if the prompt contained it.

Why it was flagged

The configuration persists full model responses and metadata to a local output directory.

Skill content
"outputDir": "~/.openclaw/workspace/modelshow-private", "includeResponseText": true, "includeMetadata": true
Recommendation

Review the output directory, avoid submitting secrets, and delete saved results when they are no longer needed.

What this means

The skill accesses local agent configuration, though the shown code only uses model alias information and does not show credential extraction.

Why it was flagged

The result saver reads local OpenClaw configuration to resolve model aliases to full model names.

Skill content
config_path = Path.home() / '.openclaw' / 'openclaw.json' ... models = data.get('agents', {}).get('defaults', {}).get('models', {})
Recommendation

Confirm that saved reports only include model names you are comfortable recording; avoid adding credential material to model alias fields.