DataPulse

Security checks across malware telemetry and agentic risk

Overview

DataPulse is mostly a real content-collection skill, but it includes under-disclosed local secret access, authenticated stealth browsing, and configurable code-execution/update paths that warrant manual review before installation.

Install only if you are comfortable giving this skill broad network, local file, browser-session, and optional subprocess authority. Before use, review and lock down environment variables, avoid setting arbitrary backend command/callable variables unless you fully trust the target, avoid storing sensitive cookies or session files unless needed, and treat self-update as a privileged manual maintenance action.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
Behavioral ASTexec() Call, eval() Call, Dynamic Import
Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain

Findings (52)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: request_payload = self._build_request(url, source_type_hint, timeout_seconds) try: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True,
Confidence: 93% confidence
Finding: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True, text=True,

Dynamic attribute access via getattr()

Low

Category: Dangerous Code Execution
Content: if not module_name or not attr_name: raise ValueError(f"Invalid factuality backend callable path: {path}") module = importlib.import_module(module_name) return getattr(module, attr_name) def _call_factuality_backend(
Confidence: 96% confidence
Finding: return getattr(module, attr_name)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: timeout_seconds = _factuality_backend_timeout_seconds() started = time.perf_counter() try: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True,
Confidence: 94% confidence
Finding: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True, text=True, timeout=t

Tainted flow: 'command' from os.getenv (line 81, credential/environment) → subprocess.run (code execution)

Medium

Category: Data Flow
Content: request_payload = self._build_request(url, source_type_hint, timeout_seconds) try: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True,
Confidence: 97% confidence
Finding: completed = subprocess.run( command, input=json.dumps(request_payload, ensure_ascii=True), capture_output=True, text=True,

Tainted flow: 'req' from os.getenv (line 106, credential/environment) → urllib.request.urlopen (network output)

Critical

Category: Data Flow
Content: for _ in range(2): try: req = urllib.request.Request(api_url, headers={"User-Agent": "Mozilla/5.0"}) with urllib.request.urlopen(req, timeout=20) as resp: data = json.loads(resp.read().decode()) if data.get("code") != 200: return ParseResult.failure(original_url, f"FxTwitter code {data.get('code')}: {data.get('message', 'unknown')}")
Confidence: 84% confidence
Finding: with urllib.request.urlopen(req, timeout=20) as resp:

Lp3

Medium

Category: MCP Least Privilege
Confidence: 94% confidence
Finding: The skill advertises substantial capabilities including environment variable access, filesystem read/write, network access, and subprocess execution, but does not declare any permissions boundary or capability manifest. This creates a trust and review gap: users and orchestrators may invoke the skill without understanding it can persist data, contact external services, and run shell-adjacent operations, increasing the chance of unintended data exposure or unsafe execution.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 91% confidence
Finding: The documented purpose frames the skill as content collection and discovery, but the behavior disclosure reveals materially broader functionality: local server hosting, session capture/storage, alert delivery to arbitrary endpoints, subprocess-driven sidecars, and self-update behavior. That mismatch is dangerous because operators may approve the skill for low-risk retrieval use while unknowingly introducing persistence, network egress, and code-execution-adjacent behaviors that expand the attack surface.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The CLI includes a self-update feature that modifies the local Python environment by installing code from a remote GitHub repository. For a skill advertised as content collection, search, and workflow tooling, this expands its authority into environment modification and supply-chain trust, increasing the blast radius if the upstream repo, network path, or release process is compromised.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: This code executes package-management commands via pip against a GitHub URL, allowing remote code to be installed into the local interpreter environment. That is a high-risk supply-chain capability unrelated to the core data-collection purpose, and compromise of the repository or release tag could lead to arbitrary code execution during installation.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: This code intentionally simulates human behavior through mouse movement, randomized scrolling, timing jitter, and related anti-detection measures. In a content collection skill, these features go beyond ordinary automation and can be used to evade bot detection and access sites that are trying to restrict automated scraping, increasing legal, policy, and abuse risk.

Intent-Code Divergence

Medium

Confidence: 88% confidence
Finding: The docstring describes the module as an optional fallback collector, but the implementation includes stealth behavior such as webdriver masking and browser fingerprint shaping. This mismatch reduces transparency and can hide higher-risk functionality from reviewers or operators, making misuse more likely in an agent context.

Context-Inappropriate Capability

Medium

Confidence: 91% confidence
Finding: The collector accepts a local storage_state file and loads it into the Playwright browser context, allowing the automation to operate with persisted authenticated sessions. In an agent skill, this creates a meaningful risk of unauthorized access to user accounts, unintended collection of private data, or use of privileged sessions without strong consent and boundary checks.

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: The helper is not limited to environment variables: it automatically reads secrets from a hard-coded Obsidian/iCloud markdown note and can follow per-secret Source references to additional local files. In an agent skill focused on content collection and search, this broad local secret harvesting behavior creates unnecessary access to unrelated credentials and expands the blast radius if the skill is invoked in a user environment.

Intent-Code Divergence

Medium

Confidence: 88% confidence
Finding: The module presents itself as handling environment-backed credentials, but the implementation silently loads from local markdown notes and referenced files as well. This mismatch is security-relevant because it defeats operator expectations, making it easier for the skill to access sensitive local data without informed consent or proper review.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: The story aggregation path supports loading and invoking an arbitrary backend callable from environment configuration, effectively granting code execution inside the process. In an agent skill context, this is more dangerous because skills often run in automation pipelines where environment variables, plugin packaging, or deployment config may be easier to tamper with than source code.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: This code can spawn any subprocess command specified through environment configuration and feed it structured evidence data. That is a true vulnerability because it provides a straightforward arbitrary-command execution path plus a data exfiltration path, especially risky in an intelligence/content aggregation skill that may handle sensitive source material and analyst workflows.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The code resolves a backend callable from `DATAPULSE_GROUNDING_BACKEND_CALLABLE`, imports the named module, and invokes the resolved attribute with no allowlist or trust validation. Anyone able to influence the process environment can cause arbitrary in-process code execution, which is especially dangerous because it runs with the application's privileges and has access to item content and runtime state.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The code reads `DATAPULSE_GROUNDING_BACKEND_CMD`, parses it, and executes the resulting command via `subprocess.run()`. Although it avoids `shell=True`, this still permits arbitrary program execution when the environment is attacker-controlled or insufficiently isolated, creating a clear command-execution boundary violation in a content-processing workflow.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The self-update path performs environment-changing installation immediately when the flag is passed, without an additional confirmation step or dry-run preview. In agent or scripted contexts, this makes accidental or coerced environment modification easier and reduces operator awareness before a privileged action occurs.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The browser automation navigates to arbitrary URLs and may do so with an authenticated storage state, yet this file provides no user-facing disclosure, approval gate, or domain restriction. In an agent workflow, that can silently trigger requests as the user, fetch protected content, and interact with sensitive sites without clear awareness or control.

Natural-Language Policy Violations

Medium

Confidence: 84% confidence
Finding: The code defaults to zh-CN locale and Asia/Shanghai timezone, which is a form of environment spoofing without user opt-in. While lower impact than session reuse, it can misrepresent client identity, alter site behavior, and support stealthy collection patterns when combined with the other anti-detection features in this module.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The collector sends the user-supplied target URL to Firecrawl as a fallback by calling a third-party API. Even though only the URL is transmitted, URLs can contain sensitive query parameters, internal resource locations, or customer-specific identifiers, and this transfer is not surfaced here as an explicit consent or policy gate.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The collector accepts a raw `cookie` value and passes it into `JinaReadOptions` for an outbound network fetch, which can cause caller-supplied session cookies or other sensitive web credentials to be transmitted to a third-party retrieval service. In an agent/assistant collection context, this is risky because users may not realize their browser/session cookies are being forwarded off-origin, enabling account/session misuse or unintended disclosure of authenticated content.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: When DATAPULSE_TWITTER_MEDIA_EXTRACT is enabled, media URLs are sent to Jina for generated alt-text extraction, which discloses third-party content to another external service. In a collection/triage skill, this can create privacy, compliance, and data-handling risks if operators or users are not clearly warned and consent is not obtained.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: When transcripts are unavailable, the collector downloads audio and uploads it to Groq's external Whisper API, creating a data egress path for potentially sensitive media content. There is no visible consent gate, policy check, or user-facing disclosure in this code, so callers may unintentionally transmit content to a third party.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal