Gigo Lobster Taster

Security checks across malware telemetry and agentic risk

Overview

This is a coherent benchmark skill, but it needs Review because it combines default cloud upload, local code execution, package installation, and automatic loading of workspace secrets with limited permission scoping.

Install only if you are comfortable running a benchmark that may execute local tests and package commands, create caches/workdirs, read OpenClaw-related profile or secrets files, and upload detailed task results to the GIGO API by default. Prefer a local/offline mode where available if you do not want cloud submission, and avoid placing unrelated secrets in workspace-level secrets.env files before running.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import
Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (67)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: break # 执行 try: proc = subprocess.run( cmd, shell=True, cwd=str(self.workdir), capture_output=True, timeout=timeout, text=True, )
Confidence: 96% confidence
Finding: proc = subprocess.run( cmd, shell=True, cwd=str(self.workdir), capture_output=True, timeout=timeout, text=True, )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: runner_path = workdir / "_cov_runner.py" runner_path.write_text(runner) try: proc = subprocess.run( [sys.executable, str(runner_path)], cwd=str(workdir), capture_output=True, timeout=40, text=True, )
Confidence: 95% confidence
Finding: proc = subprocess.run( [sys.executable, str(runner_path)], cwd=str(workdir), capture_output=True, timeout=40, text=True, )

eval() call detected

High

Category: Dangerous Code Execution
Content: print("Type a Python expression:") expr = input("> ") result = eval(expr) print("Result:", result)
Confidence: 99% confidence
Finding: result = eval(expr)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: "-r", str(status.requirements_path), ] completed = subprocess.run( command, capture_output=True, text=True,
Confidence: 86% confidence
Finding: completed = subprocess.run( command, capture_output=True, text=True, env={**os.environ, "PIP_USER": "0", "PYTHONNOUSERSITE": "1"}, check=False, )

os.system() or os exec-family call

High

Category: Dangerous Code Execution
Content: profile_argv = None effective_argv = profile_argv if isinstance(profile_argv, list) else sys.argv[1:] argv = [str(runtime_python), str(skill_root / "main.py"), *[str(item) for item in effective_argv]] os.execve(str(runtime_python), argv, env) def ensure_runtime(skill_root: Path, lang: str = "zh") -> RuntimeStatus:
Confidence: 88% confidence
Finding: os.execve(str(runtime_python), argv, env)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: started = time.time() try: completed = subprocess.run( command, shell=True, cwd=str(workdir),
Confidence: 97% confidence
Finding: completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Tainted flow: 'expr' from input (line 4, user input) → eval (code execution)

Critical

Category: Data Flow
Content: print("Type a Python expression:") expr = input("> ") result = eval(expr) print("Result:", result)
Confidence: 100% confidence
Finding: result = eval(expr)

Tainted flow: 'command' from os.environ.get (line 280, credential/environment) → subprocess.run (code execution)

Medium

Category: Data Flow
Content: started = time.time() try: completed = subprocess.run( command, shell=True, cwd=str(workdir),
Confidence: 99% confidence
Finding: completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Lp3

Medium

Category: MCP Least Privilege
Confidence: 89% confidence
Finding: The skill advertises and instructs execution of a wrapper that can access environment variables, read/write files, invoke shell commands, and use the network, yet it declares no permissions or user-facing consent boundaries. That mismatch prevents informed consent and weakens policy enforcement, especially because the same document also describes default cloud upload and leaderboard publication.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 86% confidence
Finding: The description frames the skill as a simple 'lobster tasting' workflow, but the documented behavior appears much broader: benchmarking, bootstrapping, environment loading, backend API interaction, artifact generation, and session management. This kind of description-behavior mismatch can mislead users into authorizing a much more invasive workflow than they reasonably expect.

Description-Behavior Mismatch

High

Confidence: 99% confidence
Finding: The manifest metadata advertises a narrow lobster-tasting/evaluation skill with cloud upload and leaderboard behavior, but the actual bundle defines a broad benchmark harness containing 50 unrelated coding, shell, planning, writing, and safety tasks. This is a strong capability mismatch that can mislead reviewers and users about what the skill is allowed to do, masking far more powerful behaviors than the declared purpose suggests.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: The bundle exposes many capabilities unrelated to the stated skill purpose, including code modification, shell execution, networked package installation, prompt-injection exercises, translation, and business writing. In context, this unjustified breadth materially increases attack surface and enables abuse under the cover of an innocuous skill identity.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The task behavior is materially unrelated to the skill’s declared purpose of lobster tasting, cloud upload, personal results pages, and leaderboard entry. This kind of capability/intent mismatch is dangerous because it can conceal unexpected actions from users and reviewers, making it easier to smuggle in arbitrary development or execution steps under an unrelated skill description.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: The skill instructs installing an npm package and executing a local Node.js script even though those privileges are not justified by the advertised lobster-tasting use case. Unnecessary package installation and code execution expand the attack surface, can alter the environment, and may enable supply-chain or arbitrary-code risks if a dependency or script is changed later.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: The prompt explicitly instructs the agent to run a local script from the working directory, which is a dangerous capability because repository-local files are untrusted input and may execute arbitrary code. In the context of a lobster tasting/evaluation skill, executing `dangerous.py` is unrelated to the stated purpose, making this especially suspicious and increasing the likelihood of arbitrary code execution or environment compromise.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: The file's behavior does not match the declared lobster-tasting functionality and instead implements an interactive Python evaluator. This mismatch is dangerous because it suggests hidden or deceptive capability, increasing the likelihood that arbitrary code execution was intentionally embedded under an unrelated skill description.

Context-Inappropriate Capability

Critical

Confidence: 100% confidence
Finding: Arbitrary code execution via eval() is not necessary for a lobster tasting or evaluation workflow, so its presence is highly suspicious and materially increases risk. In this skill context, the unjustified use of eval makes exploitation more dangerous because users and reviewers would not expect interpreter behavior from the advertised functionality.

Intent-Code Divergence

Medium

Confidence: 92% confidence
Finding: The docstring openly states that the tool evaluates user input as Python expressions, which contradicts the manifest's stated purpose. While the docstring itself is not the exploit, this discrepancy is a strong indicator of deceptive implementation and supports the conclusion that the dangerous behavior is intentional or at least knowingly unrelated.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: The helper enumerates and loads external `secrets.env` files from workspace-related locations that are not necessary for a lobster-tasting skill's stated purpose. This grants the skill ambient access to credentials and secrets from the broader execution environment, increasing the chance of unauthorized use, leakage, or downstream exfiltration by the imported runtime.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: This block creates a virtual environment and installs packages dynamically, which materially exceeds the expected behavior of a tasting/evaluation skill. The capability enables network-backed code acquisition and execution in the local user context, making supply-chain abuse or unintended package execution possible.

Context-Inappropriate Capability

Medium

Confidence: 90% confidence
Finding: Re-entering the program with a different interpreter is an unnecessary capability escalation for the skill's advertised function. In context, it makes the runtime harder to audit and can be combined with the bootstrap path to execute code in a newly prepared environment outside normal expectations.

Intent-Code Divergence

Low

Confidence: 84% confidence
Finding: The user-facing messages describe limited certificate/report preparation, but the code installs a broader set of packages including pytest and pytest-json-report. This mismatch reduces transparency and may mislead users or reviewers about the actual capability being introduced, which is especially concerning in code that already self-provisions dependencies.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: This code uploads detailed per-task responses, status, errors, timing, token usage, and identifiers to a remote API. That is materially broader than a simple score upload and can expose user prompts, model outputs, and operational metadata, creating privacy and data-minimization risks if users did not explicitly consent to full submission telemetry.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: In v2 mode, the skill sends a full run report built from scores, raw results, config, and upload mode to the server, which likely expands data collection beyond leaderboard publication or final evaluation. Because the manifest describes tasting/evaluating and publishing results, this broader reporting path increases the risk of undisclosed collection of submission content and execution metadata.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: The parser searches far beyond the current repository, including environment-defined roots, cwd ancestors, home-directory locations, and sibling workspaces, then reads the first matching SOUL.md/IDENTITY.md it finds. This can unintentionally ingest unrelated personal or workspace data and, in this skill's context, is more dangerous because the skill description explicitly says results are uploaded to the cloud and used for personal result pages/leaderboards, creating a plausible path for cross-project data exposure.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal