modelshow
SuspiciousAudited by ClawScan on May 10, 2026.
Overview
ModelShow is mostly aligned with its stated multi-model comparison purpose, but its workflow exposes the judge to the de-anonymization map and uses unsafe shell-style JSON piping with model-generated text.
Review this skill before installing if you need strict blind evaluation or handle sensitive prompts. If you use it, limit the configured models, avoid secrets or private files, inspect where results are saved, and prefer a version that separates judging from de-anonymization and avoids shell interpolation of generated text.
Findings (5)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Users may trust the rankings as unbiased double-blind results even though the judge can see model identities during the judging task.
The skill claims the judge evaluates blindly, but the judge sub-agent is instructed in the same task to use the anonymization map, which reveals the model behind each placeholder.
ModelShow provides ... double-blind evaluation ... uses an independent judge model to rank responses purely on merit. ... After writing your judgment, run this command: ... "anonymization_map": {anonymization_map}Separate judging from de-anonymization so the judge model never receives the anonymization map; perform de-anonymization only after the judge output is returned.
A crafted prompt or model response could cause the agent to run malformed or unintended shell commands if it follows this pattern literally.
The workflow shows shell commands that inline model responses and judge text into single-quoted JSON piped to Python; model or judge text containing quotes or shell metacharacters could break the command or be interpreted unexpectedly.
echo '{ ... "responses": {model: response_dict}, ... }' | python3 {baseDir}/judge_pipeline.py ... "judge_output": "[YOUR JUDGMENT TEXT HERE]"Avoid shell interpolation for generated text. Pass JSON via a safely written temporary file, structured tool input, or Python subprocess stdin with proper serialization.
Any sensitive prompt content or fetched context may be shared with several configured models and the judge model.
The skill intentionally sends the user prompt to multiple model agents and then sends collected responses to a judge agent.
Spawn Parallel Model Agents ... For each model in config.models ... Task: {config.systemPrompt} {extracted user prompt} ... Spawn Judge+Deanon Sub-AgentUse this skill only with prompts and referenced files you are comfortable sending to all configured models; reduce the model list for sensitive work.
Prompts, responses, judge commentary, and metadata may remain on disk after the comparison, including sensitive content if the prompt contained it.
The configuration persists full model responses and metadata to a local output directory.
"outputDir": "~/.openclaw/workspace/modelshow-private", "includeResponseText": true, "includeMetadata": true
Review the output directory, avoid submitting secrets, and delete saved results when they are no longer needed.
The skill accesses local agent configuration, though the shown code only uses model alias information and does not show credential extraction.
The result saver reads local OpenClaw configuration to resolve model aliases to full model names.
config_path = Path.home() / '.openclaw' / 'openclaw.json' ... models = data.get('agents', {}).get('defaults', {}).get('models', {})Confirm that saved reports only include model names you are comfortable recording; avoid adding credential material to model alias fields.
