AgentBench
v1.0.0Benchmark your OpenClaw agent across 40 real-world tasks. Tests file creation, research, data analysis, multi-step workflows, memory, error handling, and too...
⭐ 1· 580·1 current·1 all-time
by@exe215
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
The skill is described as an agent benchmark and contains 40 tasks and setup scripts consistent with that goal. However, several included setup scripts call git (git init, git config, git commit) yet the declared required binaries list does not include git. That is an incoherence — running the provided setup scripts will likely fail or behave unexpectedly if git is not installed. README and SKILL.md also reference uploading results to https://www.agentbench.app and 'HMAC-signed' results, but the repository as provided does not include an obvious signing key or uploader; that implies external infrastructure or missing code.
Instruction Scope
SKILL.md instructs the agent to discover tasks in this skill, create /tmp workspaces, run bundled setup.sh scripts, and then 'execute the task yourself' using tools including read/write/edit/exec/web_search/web_fetch. While most tasks operate on local inputs created by setup.sh, allowing arbitrary web_search/web_fetch and open-ended tool use broadens the agent's actions beyond purely local processing and could lead to network activity or data exfiltration. The SKILL.md attempts to constrain work to the workspace directory, but setup.sh scripts and the runtime instructions run code from the skill (shell and Python) on the host — review those scripts before running.
Install Mechanism
There is no external install spec or binary download; the skill is instruction + code files bundled in the skill directory. That reduces supply-chain risk because nothing is fetched from an arbitrary URL at install time. However, the skill includes and runs local shell and Python setup scripts (extracting and executing files contained in the skill), which still executes code on the host when invoked.
Credentials
The skill declares no required environment variables or secrets. The code writes metrics/events to /tmp/agentbench-{run_id} and uses an optional AGENTBENCH_RUN_ID env var; this is proportionate for benchmarking. No credentials or unrelated environment access are requested.
Persistence & Privilege
Flags: always:false and model invocation is allowed (normal). The skill writes results and metrics under /tmp/agentbench-{run-id} and initializes git repositories inside created workspaces; it does not request permanent system-wide presence or to change other skills' configs. No 'always:true' or other elevated privileges are requested.
What to consider before installing
This skill is plausibly what it says — a local benchmarking suite that runs 40 tasks using bundled setup scripts — but take these precautions before running it:
- Review and run in isolation: The skill executes the included setup.sh and Python scripts on your machine. Inspect those scripts (they are present under tasks/...) and run the benchmark only in a sandboxed environment (container, VM, or isolated development machine) to avoid unintended side effects.
- Ensure required tools are present: The skill declares jq, bash, python3 but many setup scripts call git. Install git or otherwise expect setup scripts that call git to fail. Also check for other utilities (awk/sed/printf) your environment might need.
- Network activity: SKILL.md allows the agent to use web_search/web_fetch and README references submitting results to https://www.agentbench.app. If you want to avoid network calls, restrict the agent's outbound network access or verify whether you (or the agent runtime) will actually perform uploads. There is no signing key included for the claimed HMAC-signed results — the signing likely happens server-side or is missing from the bundle.
- Data writes: The skill creates /tmp/agentbench-* workspaces and may initialize git repositories and write files. Confirm you are comfortable with those writes and that /tmp is acceptable.
- If you lack time to audit: prefer running a single task in a disposable environment first (e.g., /benchmark --task <id>) instead of the full suite.
Given these mismatches (missing 'git' in declared requirements, open-ended web operations, and code executed locally), proceed but only after the above checks; if you need high assurance, run it in an isolated VM or container and examine setup scripts for anything beyond the expected file and repo setup.Like a lobster shell, security has layers — review code before you run it.
latestvk977nay267dahf202nnk9pqkwn81mkgc
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
📊 Clawdis
Binsjq, bash, python3
