Dataset Finder

Security checks across malware telemetry and agentic risk

Overview

This skill appears to perform normal dataset search, download, preview, and documentation tasks, with ordinary privacy and credential-care risks for that purpose.

Install in a virtual environment, keep Kaggle and Hugging Face tokens out of source control and shared logs, download into a dedicated project directory, and review preview/data-card files before sharing because they may include raw sample rows from the dataset.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Rogue AgentSelf-Modification, Session Persistence
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (12)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 88% confidence
Finding: The skill clearly instructs users to perform network access and write files locally, but the metadata does not declare permissions or prominently warn about those capabilities. This can mislead users and policy systems about the skill's operational scope, reducing informed consent and weakening guardrails around filesystem and network actions.

Description-Behavior Mismatch

Medium

Confidence: 84% confidence
Finding: The documentation advertises additional capabilities such as batch download, conversion, splitting, and merging that go beyond the narrower stated purpose of search/download/explore/preview/data cards. Expanding operational scope without clearly declaring it increases the chance that users invoke higher-impact file-processing actions they did not expect, especially those that transform or create new local artifacts.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The skill repeatedly recommends downloads, local dataset management, output generation, and inventory creation without a clear warning that these actions write files to disk and may create, overwrite, or persist artifacts. In a dataset-oriented skill this is contextually expected, but the absence of explicit disclosure still creates operational risk and undermines safe user consent.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The documentation instructs users to store Kaggle credentials and use Hugging Face tokens but does not include guidance on secure handling, least privilege, redaction, or avoiding exposure in logs and generated artifacts. Credential misuse or accidental disclosure could enable unauthorized access to third-party accounts and data resources.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The data card examples include sample rows and schema details but do not warn that automatically generated documentation may expose personal, confidential, or regulated data present in the source dataset. This is particularly risky because dataset documentation is often shared broadly, turning local data inspection into unintentional data leakage.

Missing User Warnings

Medium

Confidence: 81% confidence
Finding: The credential setup instructions tell users to create and store Kaggle and Hugging Face credentials but do not clearly warn that these are sensitive secrets that must not be committed to source control, shared, or embedded in scripts. In a skill specifically built around authenticated downloads from external services, this omission increases the chance of token leakage and unauthorized account use.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The preview and data-card features intentionally read local datasets and include sample rows in terminal output and generated markdown. If users point the tool at files containing PII, secrets, or regulated data, those records can be exposed to logs, saved artifacts, or downstream sharing without any masking, warning, or consent step. In the context of a dataset-finder skill, this is more dangerous because previewing and documentation are core expected actions, so disclosure can happen during normal use rather than only in edge cases.

Missing User Warnings

Medium

Confidence: 86% confidence
Finding: The script sends search queries, dataset identifiers, and authenticated requests to third-party services including Kaggle, Hugging Face, and UCI without any explicit disclosure to the user that this information leaves the local environment. While this is functionally required for a dataset discovery tool, it still creates a privacy and compliance risk if queries or dataset names are sensitive, proprietary, or tied to internal projects. The skill context makes this somewhat less surprising but not less real, because users may assume only dataset content is remote, not their search terms or access patterns.

Session Persistence

Medium

Category: Rogue Agent
Content: Organize datasets for team use. ```bash # Create organized structure mkdir -p datasets/{kaggle,huggingface,uci,custom} # Download datasets with metadata
Confidence: 80% confidence
Finding: Create organized structure mkdir -p datasets/{kaggle,huggingface,uci,custom} # Download datasets with metadata python scripts/dataset.py kaggle download "dataset1" --output-dir datasets/kaggle/ pytho

Known Vulnerable Dependency: requests — 10 advisory(ies): CVE-2014-1830 (Exposure of Sensitive Information to an Unauthorized Actor in Requests); CVE-2024-47081 (Requests vulnerable to .netrc credentials leak via malicious URLs); CVE-2024-35195 (Requests `Session` object does not verify requests after making first request wi) +7 more

High

Category: Supply Chain
Confidence: 93% confidence
Finding: requests

Known Vulnerable Dependency: lxml — 10 advisory(ies): CVE-2021-43818 (lxml's HTML Cleaner allows crafted and SVG embedded scripts to pass through); CVE-2014-3146 (lxml Cross-site Scripting Via Control Characters); CVE-2021-28957 (lxml vulnerable to Cross-Site Scripting ) +7 more

High

Category: Supply Chain
Confidence: 87% confidence
Finding: lxml

Known Vulnerable Dependency: pyarrow — 8 advisory(ies): CVE-2023-47248 (PyArrow: Arbitrary code execution when loading a malicious data file); CVE-2019-12408 (Missing Initialization of Resource in Apache Arrow); CVE-2019-12410 (Missing Initialization of Resource in Apache Arrow) +5 more

Critical

Category: Supply Chain
Confidence: 96% confidence
Finding: pyarrow

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal