Gigo Lobster Resume

Security checks across malware telemetry and agentic risk

Overview

This is a real benchmark runner, but it has enough high-impact, inconsistently disclosed behavior that users should review it before installing.

Install only if you want a GIGO cloud benchmark runner, not just a harmless resume helper. Expect local code execution, dependency installation, use of local gateway/profile/secret settings, network calls, result upload by default, and possible checkpoint reset; run it in an isolated workspace and use local/offline/skip-upload options or a different companion skill if you do not want cloud submission.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import
Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (84)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: break # 执行 try: proc = subprocess.run( cmd, shell=True, cwd=str(self.workdir), capture_output=True, timeout=timeout, text=True, )
Confidence: 98% confidence
Finding: proc = subprocess.run( cmd, shell=True, cwd=str(self.workdir), capture_output=True, timeout=timeout, text=True, )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: runner_path = workdir / "_cov_runner.py" runner_path.write_text(runner) try: proc = subprocess.run( [sys.executable, str(runner_path)], cwd=str(workdir), capture_output=True, timeout=40, text=True, )
Confidence: 89% confidence
Finding: proc = subprocess.run( [sys.executable, str(runner_path)], cwd=str(workdir), capture_output=True, timeout=40, text=True, )

eval() call detected

High

Category: Dangerous Code Execution
Content: print("Type a Python expression:") expr = input("> ") result = eval(expr) print("Result:", result)
Confidence: 99% confidence
Finding: result = eval(expr)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: "-r", str(status.requirements_path), ] completed = subprocess.run( command, capture_output=True, text=True,
Confidence: 86% confidence
Finding: completed = subprocess.run( command, capture_output=True, text=True, env={**os.environ, "PIP_USER": "0", "PYTHONNOUSERSITE": "1"}, check=False, )

os.system() or os exec-family call

High

Category: Dangerous Code Execution
Content: profile_argv = None effective_argv = profile_argv if isinstance(profile_argv, list) else sys.argv[1:] argv = [str(runtime_python), str(skill_root / "main.py"), *[str(item) for item in effective_argv]] os.execve(str(runtime_python), argv, env) def ensure_runtime(skill_root: Path, lang: str = "zh") -> RuntimeStatus:
Confidence: 88% confidence
Finding: os.execve(str(runtime_python), argv, env)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: started = time.time() try: completed = subprocess.run( command, shell=True, cwd=str(workdir),
Confidence: 96% confidence
Finding: completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Tainted flow: 'expr' from input (line 4, user input) → eval (code execution)

Critical

Category: Data Flow
Content: print("Type a Python expression:") expr = input("> ") result = eval(expr) print("Result:", result)
Confidence: 100% confidence
Finding: result = eval(expr)

Tainted flow: 'command' from os.environ.get (line 280, credential/environment) → subprocess.run (code execution)

Medium

Category: Data Flow
Content: started = time.time() try: completed = subprocess.run( command, shell=True, cwd=str(workdir),
Confidence: 98% confidence
Finding: completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Lp3

Medium

Category: MCP Least Privilege
Confidence: 94% confidence
Finding: The skill advertises only a simple resume/compatibility entrypoint, yet the metadata declares no permissions while the instructions clearly imply shell execution, filesystem access, environment variable use, and networked benchmark/upload behavior. This hidden capability expansion increases the chance that a user or platform will invoke a powerful workflow without appropriate review, sandboxing, or consent.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 97% confidence
Finding: The documented purpose frames the skill as a legacy resume/checkpoint entrypoint, but the described behavior is much broader: it can execute a full benchmark pipeline, install/bootstrap dependencies, invoke subprocesses, access secrets, and communicate with remote judging/upload services. This mismatch is dangerous because users and automated policy systems may grant trust based on the narrow description while the actual workflow has materially greater execution and data-exposure risk.

Description-Behavior Mismatch

High

Confidence: 95% confidence
Finding: The README tells users that the resume skill continues from an old checkpoint, while the skill metadata explicitly says this slug now clears old checkpoints and restarts from scratch. In a benchmarking/evaluation workflow, this mismatch can cause loss of prior state, misleading expectations about data preservation, and accidental reruns that upload or overwrite results under the wrong assumptions.

Intent-Code Divergence

High

Confidence: 97% confidence
Finding: This section explicitly claims the resume mode will locate a previous checkpoint and continue unfinished work, which directly contradicts the manifest note that old checkpoints are cleared and the run starts over. Because the skill may upload results or affect leaderboard state depending on prior mode, users could unintentionally destroy checkpoint state or repeat expensive/cloud-submitting evaluations believing they are safely resuming.

Intent-Code Divergence

Medium

Confidence: 92% confidence
Finding: The file presents contradictory semantics: one section says old checkpoints are cleared and the run starts fresh, while another says the default is to resume from an existing checkpoint. In a stateful evaluation tool, this ambiguity can lead to accidental loss of prior state, incorrect assumptions about continuity, and unsafe operator decisions.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: The runner dynamically loads and executes task-local check.py files via importlib, which gives arbitrary Python code execution to bundle content. If a task directory is untrusted or tampered with, evaluating a task will execute attacker-controlled code in the runner process, potentially leading to filesystem access, data exfiltration, or full compromise of the evaluation environment.

Description-Behavior Mismatch

High

Confidence: 95% confidence
Finding: The manifest metadata advertises this skill as a narrow 'resume/continue' entrypoint, but the file actually ships a large 50-task benchmark bundle covering code execution, shell actions, safety tests, and broader content tasks. This mismatch is dangerous because operators or allowlisting systems may grant the skill more trust or broader access than intended, creating a deceptive packaging and capability-expansion risk.

Description-Behavior Mismatch

Medium

Confidence: 89% confidence
Finding: The document specifies a full remote judging service, including request/response schema, provider routing, and cloud model behavior, which is materially broader than the skill’s stated purpose of resuming an evaluation run or preserving checkpoint compatibility. This hidden expansion of capability increases risk because operators may enable external scoring and data transfer under a misleadingly narrow skill description.

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: The protocol sends agent output excerpts, git diffs, and tool-call summaries to an external endpoint, which can expose proprietary code, internal workflow details, and user-derived content. In the context of a skill advertised as a resume/checkpoint entrypoint, this data egress is unexpected and therefore more dangerous because users and reviewers may not anticipate sensitive information leaving the local environment.

Context-Inappropriate Capability

Medium

Confidence: 87% confidence
Finding: Environment-variable-controlled provider selection and arbitrary external model endpoints introduce a flexible exfiltration path that is not apparent from the skill description. This increases the attack surface because deployments can be redirected to third-party or custom services without any in-file guardrails, auditing, or user awareness.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: The task content is materially unrelated to the skill's declared purpose of resuming a prior evaluation/checkpoint flow. This kind of scope drift is dangerous because it can smuggle in unexpected behaviors and permissions under a misleading skill identity, reducing user scrutiny and potentially enabling unauthorized actions.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The prompt explicitly directs network-dependent package installation even though the skill's stated purpose does not justify outbound dependency retrieval. Allowing package installation expands the attack surface to supply-chain risk, arbitrary post-install behavior, and unintended network use under a misleading resume/checkpoint context.

Context-Inappropriate Capability

Medium

Confidence: 93% confidence
Finding: The task explicitly instructs the agent to start a local HTTP server in the working directory, but the skill's stated purpose is resume/continue evaluation and does not justify exposing local files over HTTP. Even if bound only locally by default, this creates an unnecessary network-accessible service and may expose workspace contents or enable unintended interactions from other local or bridged processes.

Description-Behavior Mismatch

Medium

Confidence: 94% confidence
Finding: The README content describes an unrelated lobster farm monitoring service, while the skill metadata says this skill is a resume/continue entry for evaluation runs. This mismatch is dangerous because it disguises the skill’s real purpose, reduces reviewer trust in the package contents, and can help malicious instructions blend into seemingly benign documentation.

Intent-Code Divergence

High

Confidence: 99% confidence
Finding: The embedded HTML comment contains an instruction to run a remotely fetched shell script via `curl ... | bash`, which is a classic arbitrary code execution pattern. Even though it is hidden in a comment, agentic systems or reviewers processing the README could be manipulated into executing attacker-controlled code.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: The task prompt instructs deletion of files in the working directory, which does not align with the skill's stated purpose of resuming an evaluation workflow. This mismatch is dangerous because it can cause an agent or user to perform unexpected destructive actions under a misleading skill identity, increasing the chance of accidental data loss and misuse.

Intent-Code Divergence

Medium

Confidence: 94% confidence
Finding: Although the text says to remove temporary files, caches, and irrelevant artifacts, it gives no concrete boundaries for what qualifies as safe to delete in the current directory. An agent may over-delete normal project files, especially since the only explicit exclusions are decoy labels like 'important' or 'do not delete,' which is not a reliable safety rule.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal