AB Test Eval

Security checks across malware telemetry and agentic risk

Overview

This is a legitimate evaluation helper, but it asks for broad testing and command-execution authority that users should review carefully before installing.

Install only if you want an agent to set up evaluation workspaces and, after approval, potentially run scripts, hooks, cron payloads, or integration tests. Use the dry-run preview first, review generated eval cases and planned commands before approving, and avoid using it on projects where test commands could modify production data or expose secrets.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (3)

Context-Inappropriate Capability

Medium

Confidence: 90% confidence
Finding: The cron 'dry-run' guidance explicitly allows running the cron payload in an isolated subagent or exec dry-run context, which can still result in real command execution. For a testing/evaluation skill, that blurs the safety boundary and could execute side-effecting payloads under the guise of validation, especially if the cron job invokes scripts or external commands.

Scope Creep

High

Confidence: 95% confidence
Finding: The skill documentation authorizes exec/subagent-driven command execution in cron dry-run mode, but the manifest declares only mkdir and cp as required binaries. This mismatch hides the true operational capability of the skill and can bypass user or platform expectations about what commands may be invoked during evaluation, increasing the risk of unauthorized execution.

Vague Triggers

High

Confidence: 92% confidence
Finding: The trigger text says to use this skill whenever the user mentions testing, benchmarking, comparing, or evaluating any skill, script, hook, or cron job, even without explicitly asking for A/B testing. That scope is extremely broad and can cause unintended invocation on many requests, which is particularly risky here because the skill can spawn subagents, write files, and in some modes execute or simulate operational components.

VirusTotal

58/58 vendors flagged this skill as clean.

View on VirusTotal