Auto Arena

AdvisoryAudited by Static analysis on Apr 30, 2026.

Overview

No suspicious patterns detected.

Findings (0)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

NoteHigh Confidence

ASI04: Agentic Supply Chain Vulnerabilities

What this means

Installing unverified or wrong packages could run code outside the reviewed artifact set.

Why it was flagged

The skill relies on external, unpinned Python packages installed manually by the user. This is expected for the skill's purpose, but users should verify package provenance before installing.

Skill content

pip install py-openjudge

# Extra dependency for auto_arena (chart generation)
pip install matplotlib

Recommendation

Install only from trusted package sources, review the package/project, and consider pinning versions in a controlled environment.

NoteHigh Confidence

ASI03: Identity and Privilege Abuse

What this means

Provider keys may grant account access and incur usage charges when the benchmark runs.

Why it was flagged

The skill requires provider API credentials to call target and judge endpoints. This is purpose-aligned, but the registry metadata lists no required credentials or environment variables.

Skill content

API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc.

Recommendation

Use least-privilege or dedicated evaluation keys where possible, set provider spending limits, and avoid pasting real keys directly into shared config files.

NoteHigh Confidence

ASI07: Insecure Inter-Agent Communication

What this means

Task descriptions, prompts, generated queries, system prompts, and model responses may be exposed to the configured model providers or agent endpoints.

Why it was flagged

The core workflow sends generated test queries to multiple user-configured endpoints and uses a judge endpoint to compare responses. This external data flow is disclosed and central to the purpose.

Skill content

Collect responses — query all target endpoints concurrently

Recommendation

Use test endpoints, avoid sensitive or confidential data in benchmark tasks, and confirm each provider's data handling policy.

NoteHigh Confidence

ASI06: Memory and Context Poisoning

What this means

Saved evaluation data may remain on disk and be reused in future runs, affecting rankings or exposing benchmark content to anyone with access to the output directory.

Why it was flagged

The skill persists and reuses evaluation artifacts such as queries, responses, and rubrics for checkpoint/resume and judge reruns. This is expected, but stored artifacts can contain sensitive content and influence later results.

Skill content

Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)

Recommendation

Store outputs in a private directory, delete old checkpoints when no longer needed, and use `--fresh` when prior generated data should not influence a new evaluation.