Auto Arena
PassAudited by VirusTotal on May 11, 2026.
Overview
Type: OpenClaw Skill Name: auto-arena Version: 1.0.0 The 'auto-arena' skill is a legitimate tool designed for benchmarking and comparing AI models using the OpenJudge framework. The SKILL.md file provides clear instructions for query generation, model evaluation, and report generation, aligning perfectly with its stated purpose. There are no signs of data exfiltration, malicious code execution, or prompt injection; the use of API keys and environment variables is standard for interacting with LLM providers like OpenAI and DashScope.
Findings (0)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Installing unverified or wrong packages could run code outside the reviewed artifact set.
The skill relies on external, unpinned Python packages installed manually by the user. This is expected for the skill's purpose, but users should verify package provenance before installing.
pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib
Install only from trusted package sources, review the package/project, and consider pinning versions in a controlled environment.
Provider keys may grant account access and incur usage charges when the benchmark runs.
The skill requires provider API credentials to call target and judge endpoints. This is purpose-aligned, but the registry metadata lists no required credentials or environment variables.
API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc.
Use least-privilege or dedicated evaluation keys where possible, set provider spending limits, and avoid pasting real keys directly into shared config files.
Task descriptions, prompts, generated queries, system prompts, and model responses may be exposed to the configured model providers or agent endpoints.
The core workflow sends generated test queries to multiple user-configured endpoints and uses a judge endpoint to compare responses. This external data flow is disclosed and central to the purpose.
Collect responses — query all target endpoints concurrently
Use test endpoints, avoid sensitive or confidential data in benchmark tasks, and confirm each provider's data handling policy.
Saved evaluation data may remain on disk and be reused in future runs, affecting rankings or exposing benchmark content to anyone with access to the output directory.
The skill persists and reuses evaluation artifacts such as queries, responses, and rubrics for checkpoint/resume and judge reruns. This is expected, but stored artifacts can contain sensitive content and influence later results.
Re-run only pairwise evaluation with new judge model # (keeps queries, responses, and rubrics)
Store outputs in a private directory, delete old checkpoints when no longer needed, and use `--fresh` when prior generated data should not influence a new evaluation.
