Auto Arena

AdvisoryAudited by Static analysis on Apr 30, 2026.

Overview

No suspicious patterns detected.

Findings (0)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Installing unverified or wrong packages could run code outside the reviewed artifact set.

Why it was flagged

The skill relies on external, unpinned Python packages installed manually by the user. This is expected for the skill's purpose, but users should verify package provenance before installing.

Skill content
pip install py-openjudge

# Extra dependency for auto_arena (chart generation)
pip install matplotlib
Recommendation

Install only from trusted package sources, review the package/project, and consider pinning versions in a controlled environment.

What this means

Provider keys may grant account access and incur usage charges when the benchmark runs.

Why it was flagged

The skill requires provider API credentials to call target and judge endpoints. This is purpose-aligned, but the registry metadata lists no required credentials or environment variables.

Skill content
API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc.
Recommendation

Use least-privilege or dedicated evaluation keys where possible, set provider spending limits, and avoid pasting real keys directly into shared config files.

What this means

Task descriptions, prompts, generated queries, system prompts, and model responses may be exposed to the configured model providers or agent endpoints.

Why it was flagged

The core workflow sends generated test queries to multiple user-configured endpoints and uses a judge endpoint to compare responses. This external data flow is disclosed and central to the purpose.

Skill content
Collect responses — query all target endpoints concurrently
Recommendation

Use test endpoints, avoid sensitive or confidential data in benchmark tasks, and confirm each provider's data handling policy.

What this means

Saved evaluation data may remain on disk and be reused in future runs, affecting rankings or exposing benchmark content to anyone with access to the output directory.

Why it was flagged

The skill persists and reuses evaluation artifacts such as queries, responses, and rubrics for checkpoint/resume and judge reruns. This is expected, but stored artifacts can contain sensitive content and influence later results.

Skill content
Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)
Recommendation

Store outputs in a private directory, delete old checkpoints when no longer needed, and use `--fresh` when prior generated data should not influence a new evaluation.