Back to skill

Security audit

agiza_agents

Security checks across malware telemetry and agentic risk

Overview

This appears to be a real engineering skill bundle, but it grants agents broad command execution, autonomous code mutation, and local testing authority with some under-disclosed risks.

Install only if you are comfortable reviewing and controlling a powerful engineering automation bundle. Run it in disposable or sandboxed repositories, avoid exposing secrets or production credentials, review every eval/build/test command before execution, do not run the skill-tester against untrusted skills on your normal host, and treat generated database migrations as drafts requiring expert review before production use.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (84)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
def run_eval_in_worktree(worktree_path, eval_cmd):
    """Run evaluation command in a worktree and return stdout."""
    try:
        result = subprocess.run(
            eval_cmd, shell=True, capture_output=True, text=True,
            cwd=worktree_path, timeout=120
        )
Confidence
99% confidence
Finding
result = subprocess.run( eval_cmd, shell=True, capture_output=True, text=True, cwd=worktree_path, timeout=120 )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
# Build if needed
if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
    if result.returncode != 0:
        print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
        sys.exit(1)
Confidence
97% confidence
Finding
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
    if "DOCKER_BUILD_CMD" in dir():
        subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
    result = subprocess.run(
        f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
        shell=True, capture_output=True, text=True
    )
Confidence
96% confidence
Finding
result = subprocess.run( f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'", shell=True, capture_output=True, text=True )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
# Measure
if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
    if "DOCKER_BUILD_CMD" in dir():
        subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
    result = subprocess.run(
        f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
        shell=True, capture_output=True, text=True
Confidence
96% confidence
Finding
subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)

    t0 = time.perf_counter()
    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
    elapsed = time.perf_counter() - t0

    if result.returncode != 0:
Confidence
96% confidence
Finding
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
for i in range(RUNS):
    # Clean if configured
    if CLEAN_CMD:
        subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)

    t0 = time.perf_counter()
    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
Confidence
95% confidence
Finding
subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
if system == "Linux":
    # Use /usr/bin/time for peak RSS
    result = subprocess.run(
        f"/usr/bin/time -v {COMMAND}",
        shell=True, capture_output=True, text=True, timeout=300
    )
Confidence
95% confidence
Finding
result = subprocess.run( f"/usr/bin/time -v {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
elif system == "Darwin":
    # macOS: use /usr/bin/time -l
    result = subprocess.run(
        f"/usr/bin/time -l {COMMAND}",
        shell=True, capture_output=True, text=True, timeout=300
    )
Confidence
95% confidence
Finding
result = subprocess.run( f"/usr/bin/time -l {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
TEST_CMD = "pytest tests/ --tb=no -q"  # Test command
# --- END CONFIG ---

result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
output = result.stdout + "\n" + result.stderr

# Try to parse pytest output: "X passed, Y failed, Z errors"
Confidence
97% confidence
Finding
result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
t0 = time.time()
    try:
        with open(log_file, "w") as lf:
            result = subprocess.run(
                eval_cmd, shell=True,
                stdout=lf, stderr=subprocess.STDOUT,
                cwd=str(project_root),
Confidence
98% confidence
Finding
result = subprocess.run( eval_cmd, shell=True, stdout=lf, stderr=subprocess.STDOUT, cwd=str(project_root), timeout=hard_limi

subprocess module call

Medium
Category
Dangerous Code Execution
Content
def run_cmd(cmd, cwd=None, timeout=None):
    """Run shell command, return (returncode, stdout, stderr)."""
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True,
        cwd=cwd, timeout=timeout
    )
Confidence
98% confidence
Finding
result = subprocess.run( cmd, shell=True, capture_output=True, text=True, cwd=cwd, timeout=timeout )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
try:
            # Try to run the script with no arguments (should not crash immediately)
            process = subprocess.run(
                [sys.executable, str(script_path)],
                capture_output=True,
                text=True,
Confidence
96% confidence
Finding
process = subprocess.run( [sys.executable, str(script_path)], capture_output=True, text=True, timeout=self.timeout,

subprocess module call

Medium
Category
Dangerous Code Execution
Content
try:
            # Test --help flag
            process = subprocess.run(
                [sys.executable, str(script_path), '--help'],
                capture_output=True,
                text=True,
Confidence
95% confidence
Finding
process = subprocess.run( [sys.executable, str(script_path), '--help'], capture_output=True, text=True, timeout=self.timeout

subprocess module call

Medium
Category
Dangerous Code Execution
Content
self.log_verbose(f"Testing with sample file: {sample_file.name}")
                
                # Try to run script with the sample file as input
                process = subprocess.run(
                    [sys.executable, str(script_path), str(sample_file)],
                    capture_output=True,
                    text=True,
Confidence
98% confidence
Finding
process = subprocess.run( [sys.executable, str(script_path), str(sample_file)], capture_output=True, text=True,

subprocess module call

Medium
Category
Dangerous Code Execution
Content
# Try running with --json flag if it looks like it supports it
            if '--json' in content:
                try:
                    process = subprocess.run(
                        [sys.executable, str(script_path), '--json', '--help'],
                        capture_output=True,
                        text=True,
Confidence
93% confidence
Finding
process = subprocess.run( [sys.executable, str(script_path), '--json', '--help'], capture_output=True, text=

Lp3

Medium
Category
MCP Least Privilege
Confidence
83% confidence
Finding
The skill advertises and instructs use of components that imply powerful capabilities such as file access, shell usage, network access, and MCP integration, but it does not declare any permissions or capability boundaries in the manifest. This creates a trust and review gap: downstream agents or users may invoke high-risk subskills without clear authorization expectations, increasing the chance of overprivileged execution or unsafe tool use.

Intent-Code Divergence

Medium
Confidence
92% confidence
Finding
The skill gives broad instructions to read config, strategy, results, and git history, then later states the agent must never read or modify files outside the target file and program.md. This contradiction weakens safety boundaries because an autonomous agent may follow the earlier broader instructions and access more repository data than the constraint appears to permit, leading to unintended data exposure or policy bypass.

Context-Inappropriate Capability

Medium
Confidence
97% confidence
Finding
The setup process executes an arbitrary user-supplied evaluation command during initialization. Because this occurs automatically as part of setup, a malicious or mistaken command can run code, alter files, access secrets available to the user, or perform network actions before the experiment even starts.

Intent-Code Divergence

Medium
Confidence
97% confidence
Finding
The file-level documentation promises rollback scripts for all changes, but the implementation does not provide a real rollback for dropped tables. This mismatch is security-relevant because operators may trust the generated plan during production schema changes and discover too late that destructive steps are not recoverable, increasing risk of irreversible data loss during incidents or failed deployments.

Intent-Code Divergence

High
Confidence
98% confidence
Finding
The zero-downtime claim is contradicted by the generated workflow for column modifications, which performs UPDATE of the full table, DROP COLUMN, and RENAME COLUMN operations directly. In real deployments these actions can lock large tables, break running application versions, and cause outages while the tool labels the strategy as zero-downtime, creating dangerous operator overconfidence.

Intent-Code Divergence

Medium
Confidence
99% confidence
Finding
The drop-table rollback path returns only a placeholder comment instead of executable recovery SQL, despite surrounding claims that rollback support exists. This makes destructive migrations materially unsafe because a failed or mistaken deployment can leave operators without an actual recovery path after the table is removed.

Context-Inappropriate Capability

Medium
Confidence
98% confidence
Finding
The sample dataset includes protected demographic attributes such as gender and ethnicity, along with pedigree-like fields such as university tier, in interview records for a hiring-related skill. Even as sample data, this normalizes collection and downstream use of sensitive attributes in evaluation pipelines, increasing the risk of discriminatory decision-making, biased model behavior, or accidental use in ranking and recommendations.

Intent-Code Divergence

Medium
Confidence
92% confidence
Finding
The documentation presents conflicting safety guarantees: it states both that testing is 'sandboxed' and 'static analysis only' while elsewhere describing runtime execution of target scripts. For a tool that validates untrusted skills, this can cause operators to underestimate execution risk and run the tester in insufficiently isolated environments, increasing the chance of arbitrary code execution during validation.

Intent-Code Divergence

Medium
Confidence
94% confidence
Finding
The file presents itself as a training example of bad patterns, but it still contains live HTTP calls to real Stripe, Square, and PayPal endpoints. In a sample-code or demo context, this mismatch is dangerous because a developer may run it assuming it is inert, causing unintended transmission of payment data or accidental interaction with production payment infrastructure.

Intent-Code Divergence

Medium
Confidence
97% confidence
Finding
The refund function explicitly states it does not process refunds, yet it returns success and "Refund initiated." This can create security and integrity issues in payment operations: downstream systems or operators may believe funds were returned when they were not, leading to customer harm, reconciliation failures, and abuse opportunities.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal