Auto Arena
AdvisoryAudited by Static analysis on Apr 30, 2026.
Overview
No suspicious patterns detected.
Findings (0)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Installing unverified or wrong packages could run code outside the reviewed artifact set.
The skill relies on external, unpinned Python packages installed manually by the user. This is expected for the skill's purpose, but users should verify package provenance before installing.
pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib
Install only from trusted package sources, review the package/project, and consider pinning versions in a controlled environment.
Provider keys may grant account access and incur usage charges when the benchmark runs.
The skill requires provider API credentials to call target and judge endpoints. This is purpose-aligned, but the registry metadata lists no required credentials or environment variables.
API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc.
Use least-privilege or dedicated evaluation keys where possible, set provider spending limits, and avoid pasting real keys directly into shared config files.
Task descriptions, prompts, generated queries, system prompts, and model responses may be exposed to the configured model providers or agent endpoints.
The core workflow sends generated test queries to multiple user-configured endpoints and uses a judge endpoint to compare responses. This external data flow is disclosed and central to the purpose.
Collect responses — query all target endpoints concurrently
Use test endpoints, avoid sensitive or confidential data in benchmark tasks, and confirm each provider's data handling policy.
Saved evaluation data may remain on disk and be reused in future runs, affecting rankings or exposing benchmark content to anyone with access to the output directory.
The skill persists and reuses evaluation artifacts such as queries, responses, and rubrics for checkpoint/resume and judge reruns. This is expected, but stored artifacts can contain sensitive content and influence later results.
Re-run only pairwise evaluation with new judge model # (keeps queries, responses, and rubrics)
Store outputs in a private directory, delete old checkpoints when no longer needed, and use `--fresh` when prior generated data should not influence a new evaluation.
