Botmark Skill

Security checks across malware telemetry and agentic risk

Overview

BotMark appears to be a real benchmark skill, but it should be reviewed before installation because it stores secrets locally and accepts server-provided runner/tool updates that can change local behavior.

Install only if you are comfortable with BotMark contacting botmark.cc, running a local Python engine, spawning up to three benchmark workers, saving a BotMark API key locally, and accepting server-provided runner/tool updates. Prefer a scoped test key, review or disable the self-update and dotfile credential paths if possible, and remove the stored key when you stop using the skill.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration

Findings (53)

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The skill instructs the agent to persistently store API credentials in a local file and source them later. Even with chmod 600, storing secrets in workspace files increases exposure to accidental disclosure, backup leakage, other tools reading the file, or misuse by future sessions.

Description-Behavior Mismatch

Medium

Confidence: 83% confidence
Finding: The documentation claims only two HTTP round trips and local encrypted scoring, but later sections show multiple remote calls for package startup, feedback, status checks, and skill download. Misrepresenting the amount of remote interaction undermines informed consent and can cause users to share data under false assumptions about locality and exposure.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The script exposes a question-export capability that dumps the assessment prompts for offline inspection. In a benchmarking skill, this weakens test integrity and enables precomputation, answer leakage, and unauthorized redistribution of evaluation content, especially because the feature exists in the same runner used for assessment orchestration.

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: The offline question-dumping path is not necessary for normal benchmark execution and creates an unnecessary content-exfiltration surface. Attackers or curious users can use it to extract proprietary or sensitive benchmark material, undermining fairness and allowing targeted optimization against the test.

Description-Behavior Mismatch

Medium

Confidence: 94% confidence
Finding: The setup script performs persistent installation and credential/configuration changes that go beyond the narrow user-facing description of 'running a benchmark and generating a report.' While installers commonly write files, this broader behavior increases trust requirements and attack surface because the script modifies OpenClaw state and stores secrets locally.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The script downloads skill code from a remote server at install time and writes executable content locally without integrity verification, pinning, or signature checks. This creates a supply-chain risk: compromise of the server, CDN, TLS termination, or API response could result in arbitrary code being installed and later executed by the host environment.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: The script reads API keys from multiple locations and persists them into openclaw.json and a fallback .botmark_env file. Storing credentials in plaintext local files broadens exposure if the host is shared, backed up insecurely, or later read by other tools or skills.

Context-Inappropriate Capability

Medium

Confidence: 93% confidence
Finding: The skill description presents a benchmarking capability, but the embedded instructions direct broad behaviors including writing executable scripts, running local code, managing files, sourcing env files, downloading code via curl, and deleting directories. This mismatch creates an unexpected execution surface that can be abused or can surprise users and hosts with actions far beyond what the description discloses.

Context-Inappropriate Capability

Medium

Confidence: 93% confidence
Finding: The skill description presents a benchmarking capability, but the embedded instructions direct broad behaviors including writing executable scripts, running local code, managing files, sourcing env files, downloading code via curl, and deleting directories. This mismatch creates an unexpected execution surface that can be abused or can surprise users and hosts with actions far beyond what the description discloses.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The manifest presents this as a simple API-based benchmark skill, but the embedded instructions direct the agent to write files, execute a downloaded Python runner, orchestrate subprocess-like parallel tasks, and even fetch code via curl if missing. That expands the trust boundary from API calls to arbitrary local code execution and filesystem mutation, which can be abused for code execution or persistence on the host under the cover of a benign benchmark workflow.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The instructions authorize deleting old local skill directories and mutating local tool definitions based on server responses, which exceeds the expected scope of a benchmarking skill. This creates a remote update channel that can alter future tool behavior or remove files on disk, enabling persistence, tampering, or destructive actions if the update mechanism is compromised or abused.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The instructions authorize deleting old local skill directories and mutating local tool definitions based on server responses, which exceeds the expected scope of a benchmarking skill. This creates a remote update channel that can alter future tool behavior or remove files on disk, enabling persistence, tampering, or destructive actions if the update mechanism is compromised or abused.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The skill directs deletion of local skill directories and legacy files unrelated to the immediate evaluation transaction. That gives a benchmarking workflow filesystem-modifying authority beyond its need and can destroy unrelated data, break other skills, or be repurposed through path confusion and unsafe cleanup logic. The risk is elevated because these deletions are framed as routine maintenance, making destructive actions more likely to be executed without scrutiny.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: The skill directs deletion of local skill directories and legacy files unrelated to the immediate evaluation transaction. That gives a benchmarking workflow filesystem-modifying authority beyond its need and can destroy unrelated data, break other skills, or be repurposed through path confusion and unsafe cleanup logic. The risk is elevated because these deletions are framed as routine maintenance, making destructive actions more likely to be executed without scrutiny.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The skill directs deletion of local skill directories and legacy files unrelated to the immediate evaluation transaction. That gives a benchmarking workflow filesystem-modifying authority beyond its need and can destroy unrelated data, break other skills, or be repurposed through path confusion and unsafe cleanup logic. The risk is elevated because these deletions are framed as routine maintenance, making destructive actions more likely to be executed without scrutiny.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: The skill goes far beyond invoking a remote evaluation API: it instructs the agent to write local files, run Python/shell commands, download or cache a runner, and spawn sub-agents/sessions. That materially expands the skill's privilege footprint and creates code-execution and persistence paths that are unnecessary for a typical API skill, increasing the risk of arbitrary code execution, unsafe file writes, and abuse of local resources if the remote service or returned content is compromised.

Description-Behavior Mismatch

High

Confidence: 96% confidence
Finding: The prompt authorizes downloading and replacing executable engine code and tool definitions from a remote service at runtime, including inline upgrades and fallback downloads. This creates a remote code and capability update channel that can change the skill's behavior after review, enabling supply-chain compromise or scope expansion beyond the declared benchmarking API usage.

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: The skill explicitly asks for the user's BotMark API key, stores it in a persistent local file, and reuses it across sessions. Persisting user secrets locally increases the blast radius of compromise, can expose credentials to other skills/processes, and exceeds what is necessary for a simple one-off benchmark invocation.

Context-Inappropriate Capability

Medium

Confidence: 88% confidence
Finding: The prompt mandates spawning parallel sub-agents and orchestrating local Python CLI workflows, which materially expands execution capabilities beyond the manifest's simple API-based evaluation description. Hidden local orchestration increases attack surface, makes behavior harder to audit, and can be abused for unintended resource consumption or secondary actions.

Intent-Code Divergence

Medium

Confidence: 86% confidence
Finding: The documentation claims only two HTTP round trips and local black-box scoring, but later instructions introduce additional API calls, feedback submission, status checks, remote downloads, and inline upgrades. This mismatch is dangerous because it misleads reviewers and users about the skill's true network behavior and trust boundaries.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The skill instructs the agent to source a local env file and persist the owner's API key to disk, expanding credential handling beyond the stated env-var-only setup. Persisting secrets in skill-local files increases the chance of accidental disclosure, reuse across contexts, and unsafe handling by other tools or skills on the same host.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: The prompt authorizes the skill to replace local tool definitions and persist upgraded versions based on server responses. This creates a remote self-modifying behavior path that can alter future agent capabilities outside the original manifest scope, increasing supply-chain and persistence risk if the server or response is compromised.

Vague Triggers

Medium

Confidence: 91% confidence
Finding: The documented trigger phrases include broad, natural-language commands such as 'Test yourself' and 'Evaluate your capabilities', which can overlap with normal user requests outside an intentional benchmark flow. In an agent environment, this can cause unintended skill activation, leading the bot to start a multi-step external evaluation workflow and potentially make network calls without the user's clear, specific intent.

Vague Triggers

Medium

Confidence: 89% confidence
Finding: The trigger phrases are broad and include common terms like evaluation, score, check, or benchmark, which can cause the skill to activate in contexts the user did not intend. Because activation leads to external API use and potential credential workflows, accidental triggering expands the chance of unintended data transmission or secret handling.

Vague Triggers

Medium

Confidence: 86% confidence
Finding: The skill instructs the agent to proactively suggest benchmarking after upgrades or ability-related questions without a clear consent boundary. Proactive prompting can steer users into a remote workflow involving profiling data and API credentials even when they did not ask to start that process.

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal