Data Cleaner

Security checks across malware telemetry and agentic risk

Overview

This data-cleaning skill appears legitimate, but it can send dataset-derived content to AI providers and Feishu despite claiming processing is local.

Review before installing. Use only with data you are authorized to share, and treat AI classification, AI field identification, Feishu output, and Pro report generation as potential external disclosure paths. Prefer local-only runs with no AI key and no Feishu modules available unless the developer adds clear opt-in prompts, redaction, and accurate privacy documentation.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (25)

Tainted flow: 'req' from os.environ.get (line 367, credential/environment) → urllib.request.urlopen (network output)

Critical
Category
Data Flow
Content
},
                method="POST",
            )
            with urllib.request.urlopen(req, timeout=20) as resp:
                raw = json.loads(resp.read().decode("utf-8"))
            content = raw["choices"][0]["message"]["content"]
Confidence
98% confidence
Finding
with urllib.request.urlopen(req, timeout=20) as resp:

Tainted flow: 'req' from os.environ.get (line 440, credential/environment) → urllib.request.urlopen (network output)

Critical
Category
Data Flow
Content
},
                method="POST",
            )
            with urllib.request.urlopen(req, timeout=15) as resp:
                raw = json.loads(resp.read().decode("utf-8"))
            content = raw["choices"][0]["message"]["content"]
Confidence
97% confidence
Finding
with urllib.request.urlopen(req, timeout=15) as resp:

Tainted flow: 'req' from os.environ.get (line 224, credential/environment) → urllib.request.urlopen (network output)

Critical
Category
Data Flow
Content
},
            data=b"{}",
        )
        with urllib.request.urlopen(req, timeout=10) as resp:
            data = json.loads(resp.read().decode("utf-8"))
            if data.get("valid", False):
                result = {"valid": True, "tier": _prefix_to_tier(api_key)}
Confidence
95% confidence
Finding
with urllib.request.urlopen(req, timeout=10) as resp:

Intent-Code Divergence

High
Confidence
98% confidence
Finding
The README explicitly states that all processing happens locally, yet the same document describes optional AI features using external providers (MiniMax/DeepSeek) and Feishu output/sharing. This is a material transparency and data-handling issue because users may expose sensitive datasets under a false assumption that no third-party transmission occurs.

Description-Behavior Mismatch

Medium
Confidence
89% confidence
Finding
The skill is marketed as a local data-cleaning utility, but later documentation includes AI-based identification/classification and Feishu-native publishing, which can involve external processing. That mismatch can mislead users into providing confidential CRM or spreadsheet data without understanding the actual exposure surface.

Description-Behavior Mismatch

Medium
Confidence
91% confidence
Finding
This module is presented as a data classification/tagging component, but it also performs network-based AI inference. That expands the module's effective capability beyond local rule evaluation and can surprise integrators who may run it on sensitive customer data under the assumption that processing is local-only.

Context-Inappropriate Capability

High
Confidence
99% confidence
Finding
The function serializes the first 20 rows of the DataFrame and sends them to a third-party AI API. In a data-cleaning/classification skill, this is especially dangerous because input datasets often contain PII, financial data, or business-sensitive records, and the transmission is not minimized, anonymized, or explicitly disclosed in code behavior.

Description-Behavior Mismatch

Medium
Confidence
91% confidence
Finding
The orchestrator performs remote Feishu exports and document publication in addition to local cleaning, which introduces outbound data transfer behavior not clearly reflected in the stated purpose or top-level usage description. This is dangerous because a caller may invoke a 'data cleaner' expecting local-only processing while sensitive dataset content and metadata are transmitted to third-party services.

Context-Inappropriate Capability

Medium
Confidence
88% confidence
Finding
Bitable and Feishu document publishing are significant external side effects that are not obviously necessary for a multi-source data cleaning skill. In a data-processing context, hidden or weakly disclosed publication features increase the risk of accidental exfiltration of cleaned records, reports, and associated identifiers to external platforms.

Context-Inappropriate Capability

Medium
Confidence
92% confidence
Finding
The Feishu document export sends report content to an external cloud service but does not enforce the same subscription or capability gate used for Bitable export. In a data-cleaning skill, reports may contain sensitive source-derived data or quality findings, so exposing a remote publication path without equivalent controls increases the chance of unintended external disclosure.

Description-Behavior Mismatch

Medium
Confidence
88% confidence
Finding
A module described as tier-limit enforcement also performs outbound network-based token verification, which is a hidden capability beyond its stated role. This violates least surprise and least privilege, making review and deployment riskier because a seemingly local policy module can contact external infrastructure and influence trust decisions.

Context-Inappropriate Capability

Medium
Confidence
93% confidence
Finding
The code reads a credential from the environment and sends it to a remote endpoint, a behavior not clearly justified by the stated purpose of the component. This creates a real data-flow risk because sensitive secrets may be disclosed to an external service without clear user awareness or architectural necessity.

Vague Triggers

Medium
Confidence
84% confidence
Finding
The listed trigger phrases are generic terms like 'data cleaning' and 'Excel cleaning' that are likely to match ordinary user requests broadly. In an agent environment, overly broad triggers increase the chance of unintended invocation on sensitive business data, causing accidental processing, sharing, or external transmission.

Missing User Warnings

High
Confidence
97% confidence
Finding
The README promotes AI classification/tagging and references external API keys, but does not warn users that uploaded or parsed data may be transmitted to third-party model providers. For a data-cleaning skill likely used on customer, CRM, or financial records, this omission creates significant privacy, compliance, and confidentiality risk.

Vague Triggers

Medium
Confidence
84% confidence
Finding
The trigger words are broad phrases like 'clean data', 'merge data', and 'Excel cleaning' that can match many ordinary user requests outside the intended scope. Over-broad activation increases the chance the skill is invoked on sensitive datasets unintentionally, especially given the described AI processing and Feishu export features.

Vague Triggers

Medium
Confidence
80% confidence
Finding
The Feishu trigger section repeats generic activation phrases without defining when the skill should or should not engage. In context, this is more dangerous because the skill handles potentially sensitive business records and supports external write-back to Feishu, so ambiguous activation can cause accidental handling or disclosure of confidential data.

Missing User Warnings

Medium
Confidence
92% confidence
Finding
The skill encourages users to upload messy datasets and highlights AI field identification plus Feishu Bitable/Doc output, but it provides no warning that uploaded records may contain personal, financial, or business-sensitive data that could be sent to third-party AI services or external collaboration platforms. Because the listed use cases include CRM, bank statements, and rosters, the absence of a privacy and data-handling warning materially increases the risk of unauthorized disclosure.

Missing User Warnings

High
Confidence
98% confidence
Finding
The code sends sample dataset rows to an external service without any user-facing warning, consent check, or inline privacy guard. In the context of a multi-source data cleaner, users are likely to process raw business/customer data, so silent external transmission materially increases confidentiality and compliance risk.

Missing User Warnings

Medium
Confidence
98% confidence
Finding
The AI fallback packages uncertain column names and sample values and sends them to external model providers, but the function interface and comments do not clearly warn the caller that dataset contents may leave the local environment. In a data-cleaning skill, columns may include PII such as names, phone numbers, addresses, IDs, or bank accounts, making this context especially sensitive.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The Bitable export path transmits dataset-derived content along with Feishu identifiers such as open_id and folder_token, yet there is no visible user-facing disclosure or confirmation in this file before transfer occurs. In a data cleaner, this makes accidental privacy and compliance violations more likely because users may not realize personally identifiable or sensitive business data is leaving the local environment.

Missing User Warnings

Medium
Confidence
93% confidence
Finding
The generated quality report is uploaded to Feishu as a document without a visible notice at the call site, despite reports potentially containing source names, inferred field types, data quality issues, and summary statistics about the dataset. This is dangerous because even derived reports can expose sensitive operational or personal information when sent to an external service unexpectedly.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The Bitable export path uploads dataframe contents to Feishu, including all converted records, with no warning, confirmation, or indication in this module that data is being transferred to a third-party service. For a multi-source data cleaner, exported records may include PII or business data, so silent remote transmission creates a real confidentiality risk if users expect only local processing.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The Feishu document export sends report_markdown to an external service without any visible warning or consent mechanism in this file. Quality reports can embed sample values, error summaries, or sensitive operational details, so uploading them off-platform without explicit notice can leak confidential information.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The report includes raw sample values from dataset columns and renders them directly into Markdown output. If the source data contains personal, confidential, or regulated information, this can leak sensitive content to report viewers or downstream systems without masking, consent checks, or disclosure, making the reporting layer itself a data exposure vector.

Missing User Warnings

Medium
Confidence
90% confidence
Finding
The file silently sends an API key to a remote verification service with no user-facing disclosure, warning, or consent mechanism. Even if the transmission is intended for licensing, undisclosed secret transmission is dangerous because operators may unknowingly expose production credentials to an external party.

VirusTotal

No VirusTotal findings

View on VirusTotal