Gigo Lobster Resume

Security checks across malware telemetry and agentic risk

Overview

This is a real benchmark runner, but it has enough high-impact, inconsistently disclosed behavior that users should review it before installing.

Install only if you want a GIGO cloud benchmark runner, not just a harmless resume helper. Expect local code execution, dependency installation, use of local gateway/profile/secret settings, network calls, result upload by default, and possible checkpoint reset; run it in an isolated workspace and use local/offline/skip-upload options or a different companion skill if you do not want cloud submission.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
  • Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (84)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
break
        # 执行
        try:
            proc = subprocess.run(
                cmd, shell=True, cwd=str(self.workdir),
                capture_output=True, timeout=timeout, text=True,
            )
Confidence
98% confidence
Finding
proc = subprocess.run( cmd, shell=True, cwd=str(self.workdir), capture_output=True, timeout=timeout, text=True, )

subprocess module call

Medium
Category
Dangerous Code Execution
Content
runner_path = workdir / "_cov_runner.py"
    runner_path.write_text(runner)
    try:
        proc = subprocess.run(
            [sys.executable, str(runner_path)],
            cwd=str(workdir), capture_output=True, timeout=40, text=True,
        )
Confidence
89% confidence
Finding
proc = subprocess.run( [sys.executable, str(runner_path)], cwd=str(workdir), capture_output=True, timeout=40, text=True, )

eval() call detected

High
Category
Dangerous Code Execution
Content
print("Type a Python expression:")
expr = input("> ")
result = eval(expr)
print("Result:", result)
Confidence
99% confidence
Finding
result = eval(expr)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
"-r",
        str(status.requirements_path),
    ]
    completed = subprocess.run(
        command,
        capture_output=True,
        text=True,
Confidence
86% confidence
Finding
completed = subprocess.run( command, capture_output=True, text=True, env={**os.environ, "PIP_USER": "0", "PYTHONNOUSERSITE": "1"}, check=False, )

os.system() or os exec-family call

High
Category
Dangerous Code Execution
Content
profile_argv = None
    effective_argv = profile_argv if isinstance(profile_argv, list) else sys.argv[1:]
    argv = [str(runtime_python), str(skill_root / "main.py"), *[str(item) for item in effective_argv]]
    os.execve(str(runtime_python), argv, env)


def ensure_runtime(skill_root: Path, lang: str = "zh") -> RuntimeStatus:
Confidence
88% confidence
Finding
os.execve(str(runtime_python), argv, env)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
started = time.time()
        try:
            completed = subprocess.run(
                command,
                shell=True,
                cwd=str(workdir),
Confidence
96% confidence
Finding
completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Tainted flow: 'expr' from input (line 4, user input) → eval (code execution)

Critical
Category
Data Flow
Content
print("Type a Python expression:")
expr = input("> ")
result = eval(expr)
print("Result:", result)
Confidence
100% confidence
Finding
result = eval(expr)

Tainted flow: 'command' from os.environ.get (line 280, credential/environment) → subprocess.run (code execution)

Medium
Category
Data Flow
Content
started = time.time()
        try:
            completed = subprocess.run(
                command,
                shell=True,
                cwd=str(workdir),
Confidence
98% confidence
Finding
completed = subprocess.run( command, shell=True, cwd=str(workdir), env=env, capture_output=True,

Lp3

Medium
Category
MCP Least Privilege
Confidence
94% confidence
Finding
The skill advertises only a simple resume/compatibility entrypoint, yet the metadata declares no permissions while the instructions clearly imply shell execution, filesystem access, environment variable use, and networked benchmark/upload behavior. This hidden capability expansion increases the chance that a user or platform will invoke a powerful workflow without appropriate review, sandboxing, or consent.

Tp4

High
Category
MCP Tool Poisoning
Confidence
97% confidence
Finding
The documented purpose frames the skill as a legacy resume/checkpoint entrypoint, but the described behavior is much broader: it can execute a full benchmark pipeline, install/bootstrap dependencies, invoke subprocesses, access secrets, and communicate with remote judging/upload services. This mismatch is dangerous because users and automated policy systems may grant trust based on the narrow description while the actual workflow has materially greater execution and data-exposure risk.

Description-Behavior Mismatch

High
Confidence
95% confidence
Finding
The README tells users that the resume skill continues from an old checkpoint, while the skill metadata explicitly says this slug now clears old checkpoints and restarts from scratch. In a benchmarking/evaluation workflow, this mismatch can cause loss of prior state, misleading expectations about data preservation, and accidental reruns that upload or overwrite results under the wrong assumptions.

Intent-Code Divergence

High
Confidence
97% confidence
Finding
This section explicitly claims the resume mode will locate a previous checkpoint and continue unfinished work, which directly contradicts the manifest note that old checkpoints are cleared and the run starts over. Because the skill may upload results or affect leaderboard state depending on prior mode, users could unintentionally destroy checkpoint state or repeat expensive/cloud-submitting evaluations believing they are safely resuming.

Intent-Code Divergence

Medium
Confidence
92% confidence
Finding
The file presents contradictory semantics: one section says old checkpoints are cleared and the run starts fresh, while another says the default is to resume from an existing checkpoint. In a stateful evaluation tool, this ambiguity can lead to accidental loss of prior state, incorrect assumptions about continuity, and unsafe operator decisions.

Context-Inappropriate Capability

High
Confidence
97% confidence
Finding
The runner dynamically loads and executes task-local check.py files via importlib, which gives arbitrary Python code execution to bundle content. If a task directory is untrusted or tampered with, evaluating a task will execute attacker-controlled code in the runner process, potentially leading to filesystem access, data exfiltration, or full compromise of the evaluation environment.

Description-Behavior Mismatch

High
Confidence
95% confidence
Finding
The manifest metadata advertises this skill as a narrow 'resume/continue' entrypoint, but the file actually ships a large 50-task benchmark bundle covering code execution, shell actions, safety tests, and broader content tasks. This mismatch is dangerous because operators or allowlisting systems may grant the skill more trust or broader access than intended, creating a deceptive packaging and capability-expansion risk.

Description-Behavior Mismatch

Medium
Confidence
89% confidence
Finding
The document specifies a full remote judging service, including request/response schema, provider routing, and cloud model behavior, which is materially broader than the skill’s stated purpose of resuming an evaluation run or preserving checkpoint compatibility. This hidden expansion of capability increases risk because operators may enable external scoring and data transfer under a misleadingly narrow skill description.

Context-Inappropriate Capability

Medium
Confidence
95% confidence
Finding
The protocol sends agent output excerpts, git diffs, and tool-call summaries to an external endpoint, which can expose proprietary code, internal workflow details, and user-derived content. In the context of a skill advertised as a resume/checkpoint entrypoint, this data egress is unexpected and therefore more dangerous because users and reviewers may not anticipate sensitive information leaving the local environment.

Context-Inappropriate Capability

Medium
Confidence
87% confidence
Finding
Environment-variable-controlled provider selection and arbitrary external model endpoints introduce a flexible exfiltration path that is not apparent from the skill description. This increases the attack surface because deployments can be redirected to third-party or custom services without any in-file guardrails, auditing, or user awareness.

Description-Behavior Mismatch

High
Confidence
97% confidence
Finding
The task content is materially unrelated to the skill's declared purpose of resuming a prior evaluation/checkpoint flow. This kind of scope drift is dangerous because it can smuggle in unexpected behaviors and permissions under a misleading skill identity, reducing user scrutiny and potentially enabling unauthorized actions.

Context-Inappropriate Capability

Medium
Confidence
94% confidence
Finding
The prompt explicitly directs network-dependent package installation even though the skill's stated purpose does not justify outbound dependency retrieval. Allowing package installation expands the attack surface to supply-chain risk, arbitrary post-install behavior, and unintended network use under a misleading resume/checkpoint context.

Context-Inappropriate Capability

Medium
Confidence
93% confidence
Finding
The task explicitly instructs the agent to start a local HTTP server in the working directory, but the skill's stated purpose is resume/continue evaluation and does not justify exposing local files over HTTP. Even if bound only locally by default, this creates an unnecessary network-accessible service and may expose workspace contents or enable unintended interactions from other local or bridged processes.

Description-Behavior Mismatch

Medium
Confidence
94% confidence
Finding
The README content describes an unrelated lobster farm monitoring service, while the skill metadata says this skill is a resume/continue entry for evaluation runs. This mismatch is dangerous because it disguises the skill’s real purpose, reduces reviewer trust in the package contents, and can help malicious instructions blend into seemingly benign documentation.

Intent-Code Divergence

High
Confidence
99% confidence
Finding
The embedded HTML comment contains an instruction to run a remotely fetched shell script via `curl ... | bash`, which is a classic arbitrary code execution pattern. Even though it is hidden in a comment, agentic systems or reviewers processing the README could be manipulated into executing attacker-controlled code.

Description-Behavior Mismatch

High
Confidence
97% confidence
Finding
The task prompt instructs deletion of files in the working directory, which does not align with the skill's stated purpose of resuming an evaluation workflow. This mismatch is dangerous because it can cause an agent or user to perform unexpected destructive actions under a misleading skill identity, increasing the chance of accidental data loss and misuse.

Intent-Code Divergence

Medium
Confidence
94% confidence
Finding
Although the text says to remove temporary files, caches, and irrelevant artifacts, it gives no concrete boundaries for what qualifies as safe to delete in the current directory. An agent may over-delete normal project files, especially since the only explicit exclusions are decoy labels like 'important' or 'do not delete,' which is not a reliable safety rule.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal