Benchclaw Openclaw Benchmark

Security checks across malware telemetry and agentic risk

Overview

BenchClaw appears to be a real benchmark tool, but it needs review because it handles OpenClaw session metadata, mutates shared OpenClaw session state, and uploads benchmark data by default unless local-only mode is selected.

Install only after reviewing the disclosure and running it in a context where benchmark prompts may safely exercise your OpenClaw agent. Use local-only mode by setting upload_to_server=false if you do not want leaderboard submission; note that question fetching still uses the network. Avoid running it in sessions or workspaces containing sensitive data, and treat temp logs/reports as sensitive because they may include session routing metadata and stdout/stderr snippets.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (19)

Intent-Code Divergence

Low

Confidence: 84% confidence
Finding: The documentation says the dependencies only include cryptography and psutil, while the declared package list also includes requests and the skill performs network fetch/upload behavior. Understating dependencies can mislead operators about network-capable components being installed and used, which weakens informed consent and security review. In this skill's context, that matters because it fetches remote question sets and may upload benchmark data by default.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: The skill can delete and recreate OpenClaw agents automatically when it decides the workspace is stale or unknown. In a benchmarking context, this mutates persistent user configuration and can destroy prior agent setup, creating integrity and availability risk if triggered on the wrong agent ID or with attacker-influenced parameters.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The cleanup functions delete stored session transcripts and session indexes from the user's ~/.openclaw state. In a benchmark tool this is risky because transcripts may contain audit history, debugging evidence, or valuable usage records, and the deletion is broad enough to cause data loss if invoked on the wrong agent or prefix.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: The code reads channel/target metadata from caller-controlled state and uses it to send outbound messages via the OpenClaw CLI. That enables the benchmarking skill to contact arbitrary recipients or external channels, which exceeds pure local scoring behavior and can be abused for spam, unsolicited contact, or exfiltration of benchmark results/status to attacker-chosen destinations.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The script logs the OpenClaw session ID, session key, channel, and target directly to the logger. Session keys and related identifiers are sensitive authentication or routing material; exposing them in logs can enable session hijacking, impersonation, or unauthorized access by anyone who can read the log file or generated artifacts.

Description-Behavior Mismatch

Medium

Confidence: 98% confidence
Finding: The skill deletes session transcript files and lock files and rewrites the OpenClaw session store, which goes beyond the advertised benchmarking/reporting purpose. Because this mutation runs automatically in the __main__ path, it can destroy user session history or interfere with other sessions without explicit consent, making the benchmark skill operationally dangerous.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The code actively sends channel and agent messages into user sessions, including relayed prompts that instruct another agent how to respond. That behavior is not disclosed by the skill description and creates an unexpected cross-session side effect, which can spam users, leak benchmark activity into live conversations, or manipulate agent behavior outside the benchmark task itself.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The skill reads the global OpenClaw session store and resolves delivery context such as channel and target from prior session metadata. For a benchmarking tool, this is unnecessary privilege expansion that enables inspection of unrelated session routing data and supports later cross-session messaging behavior.

Intent-Code Divergence

Low

Confidence: 90% confidence
Finding: The code claims progress must go to the current session, but if environment hints are absent it falls back to the most recently updated session. In a multi-user or multi-session environment, that can misroute benchmark notifications or final agent prompts into the wrong conversation, causing information leakage and unintended interaction with another user's session.

Context-Inappropriate Capability

Medium

Confidence: 85% confidence
Finding: The function collects environment fingerprinting data including virtualization type, cloud provider hints, CPU count, RAM size, and DMI vendor/product information, then states it is used for reporting to a leaderboard. In a benchmarking skill, some hardware info is relevant, but provider and host-type fingerprinting go beyond the minimum needed and can expose sensitive infrastructure metadata that aids profiling or targeting if transmitted or logged.

Context-Inappropriate Capability

Medium

Confidence: 98% confidence
Finding: The metric extractor joins user-controlled target_file directly with workspace_dir and only checks os.path.isfile() before reading. An attacker who can influence benchmark/question configuration can supply path traversal such as ../../etc/passwd to read files outside the workspace, causing unintended local file disclosure during evaluation.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: If agent_output.json is absent, _extract_reply_content falls back to reading any file at os.path.join(workspace_dir, target_path) with no path normalization or containment check. A crafted target_path can escape the workspace and expose arbitrary readable files to the verification logic.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The README directly instructs users to install and run the benchmark without a clear up-front warning that it may access local files, read or modify configuration, inspect hardware information, and submit benchmark-related data over the network. In a skill ecosystem, users often treat quick-start steps as safe defaults, so omitted disclosure can lead to uninformed execution of a tool with broader system and data access than the brief install/run commands suggest.

Vague Triggers

Medium

Confidence: 88% confidence
Finding: The Chinese trigger phrases include common language like '跑分' and '跑个分', which are broad enough to be invoked outside a clearly benchmark-specific context. Because this skill can install dependencies, execute scripts, fetch remote content, and potentially upload results, accidental activation could lead to unexpected resource consumption, network activity, and file writes. The benchmark context increases risk because the action is long-running and expensive rather than a harmless lookup.

Vague Triggers

Medium

Confidence: 90% confidence
Finding: The English triggers like 'Benchmark', 'Scoring', and 'Performance Metrics' are overly generic and could match ordinary discussion rather than a request to execute this skill. In this skill, activation is more dangerous than usual because it may launch long-running benchmark jobs, incur substantial token/API cost, retrieve remote tasks, and upload summarized outputs by default. That makes ambiguous invocation materially risky.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: These log lines expose sensitive session and targeting metadata in plaintext without warning or redaction. Even if the values are not immediately exploitable as credentials, they reveal private user/session context and increase the blast radius of log access or accidental artifact sharing.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: The cleanup routine removes session files and rewrites the session store with no user warning, confirmation, or transactional safety. Silent destructive actions against persistent session data are dangerous because they can erase history, disrupt active sessions, and make recovery difficult if the write partially fails or targets the wrong entries.

Missing User Warnings

Medium

Confidence: 80% confidence
Finding: The code performs system inspection and invokes systemd-detect-virt without any nearby user-facing disclosure, consent flow, or warning that host metadata will be gathered. In an agent benchmark context this is more concerning because the skill advertises evaluation and reporting features, so undisclosed environment collection may surprise users and leak infrastructure details beyond expected local scoring behavior.

Ssd 3

Medium

Confidence: 95% confidence
Finding: Logging session identifiers plus channel/target metadata creates a durable local record of sensitive operational data. In the context of a benchmark tool that also generates reports and local artifacts, this increases the chance that private session-routing details are retained, copied, or exposed beyond the active run.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal