Smartness Eval Open Source

Security checks across malware telemetry and agentic risk

Overview

This skill is a disclosed OpenClaw evaluation tool, but users should review its local test execution and optional LLM-judge privacy settings before running it.

Install only if you want a tool that evaluates an OpenClaw workspace by reading local runtime state and running local benchmark scripts. Review config/task-suite.json before use, avoid --llm-judge on sensitive workspaces unless you accept sending summary evidence to the configured provider, and redact generated reports before sharing them publicly.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
  • Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (7)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
# ═══════════════════════════════════════════════════════════════════════════

def run_cmd(command: list[str], timeout: int = 120) -> subprocess.CompletedProcess:
    return subprocess.run(command, cwd=str(WORKSPACE),
                          capture_output=True, text=True, timeout=timeout)
Confidence
87% confidence
Finding
return subprocess.run(command, cwd=str(WORKSPACE), capture_output=True, text=True, timeout=timeout)

Tainted flow: 'req' from os.environ.get (line 517, credential/environment) → urllib.request.urlopen (network output)

Critical
Category
Data Flow
Content
headers={'Content-Type': 'application/json', 'Authorization': f'Bearer {api_key}'},
    )
    try:
        with urllib.request.urlopen(req, timeout=15) as resp:
            body = json.loads(resp.read())
        text = body['choices'][0]['message']['content'].strip()
        if text.startswith('```'):
Confidence
96% confidence
Finding
with urllib.request.urlopen(req, timeout=15) as resp:

Lp3

Medium
Category
MCP Least Privilege
Confidence
93% confidence
Finding
The skill declares no permissions, yet its own documentation describes capabilities to read files, write reports, execute subprocess commands, access environment variables for API keys, and optionally make outbound network requests. This mismatch weakens least-privilege enforcement and can mislead users or hosting systems about the skill’s actual authority, increasing the chance of unsafe execution or policy bypass.

Missing User Warnings

Medium
Confidence
88% confidence
Finding
The README explicitly describes collection and processing of multiple local state sources, including logs, alerts, orchestrator data, and interaction samples, but does not clearly warn that these files may contain sensitive operational or user-derived data. In an evaluation skill, this context increases risk because users may run it against production workspaces and unintentionally expose private content in generated reports, history files, or downstream sharing workflows.

Missing User Warnings

Medium
Confidence
93% confidence
Finding
The public sharing guidance tells users to include environment and evaluation details, but it does not warn that those details can reveal sensitive information such as internal versioning, host/OS characteristics, workspace identifiers, operational weaknesses, or benchmark artifacts. Because this file is explicitly a showcase and sharing guide, users are likely to copy these recommendations directly into public posts, increasing the chance of unintentional information disclosure.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The X/Twitter and GitHub discussion templates encourage publishing scores, dimensions, and evidence metrics without any caution about exposing benchmark data, system behavior, or performance weaknesses. In practice, this can leak operational signals such as latency, weak dimensions, or evaluation provenance that attackers or competitors could use for targeting, fingerprinting, or social engineering.

Missing User Warnings

Medium
Confidence
91% confidence
Finding
The LLM judge path transmits dimension scores and evidence collected from local state to a third-party service without any in-band warning, consent prompt, or redaction step. In this skill context, the evidence is derived from internal logs, error data, and knowledge stores, so sending it off-box can leak operationally sensitive information.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal