Security audit

Empyrical Risk Metrics

Security checks across malware telemetry and agentic risk

Overview

This skill needs review because it appears to mix a narrow risk-metrics purpose with trading, backtesting, data fetching, and unrelated documentation workflows.

Install only after the publisher clarifies whether this is a risk-metrics skill, a trading/backtesting skill, or a documentation automation skill. Avoid giving it live brokerage credentials or authority to execute trades, fetch broad market data, or write persistent caches unless those actions are explicitly requested and bounded.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration

Findings (16)

Description-Behavior Mismatch

High

Confidence: 96% confidence
Finding: The skill claims to compute portfolio risk metrics, but its documented pipeline extends into data collection, target selection, and trading execution. This scope mismatch can cause an agent to invoke the skill for actions far beyond analytics, increasing the chance of unauthorized trading or unsafe workflow expansion under a misleading label.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The top use cases describe Sphinx documentation build and deployment, which are unrelated to investment risk-metric analysis. Such contradictions make the skill's true behavior ambiguous and can misroute an agent into performing filesystem, build, or deployment actions when the user expects financial analytics.

Intent-Code Divergence

Medium

Confidence: 94% confidence
Finding: The interaction prompts ask for market, provider, strategy, backtest dates, and target entities, which are inputs for trading or backtesting rather than pure risk-metric calculation. This misleading interface broadens operational scope and may steer users or agents into data access and strategy execution paths they did not intend to authorize.

Intent-Code Divergence

Medium

Confidence: 95% confidence
Finding: The semantic locks are specific to trading systems, including order sequencing, next-bar execution, entity formats, and signal semantics, which contradict a risk-metrics-only skill. Embedding trading constraints in an analytics skill increases the likelihood that orchestration systems treat it as authorized for trade-related actions.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: The human summary claims the skill can build full ZVT quant workflows, fetch data, and run backtests, which materially exceeds the declared scope of calculating risk metrics. This creates a scope-mismatch vulnerability: an agent may invoke or trust the skill for broader trading and data-access tasks than intended, enabling capability confusion and unsafe delegation.

Context-Inappropriate Capability

Medium

Confidence: 98% confidence
Finding: The documented end-to-end backtesting and trading pipeline capabilities go far beyond passive risk-metric calculation and could cause an agent to use this skill for strategy execution or trading-adjacent decisions. In the context of a supposedly narrow analytics skill, that mismatch is dangerous because it can expand operational authority, increase financial risk, and bypass intended review of trading functionality.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: The documented end-to-end backtesting and trading pipeline capabilities go far beyond passive risk-metric calculation and could cause an agent to use this skill for strategy execution or trading-adjacent decisions. In the context of a supposedly narrow analytics skill, that mismatch is dangerous because it can expand operational authority, increase financial risk, and bypass intended review of trading functionality.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: This lock file defines trading execution semantics, market-specific constraints, recorder behavior, and environment setup that are far outside the stated scope of a skill for computing portfolio risk metrics. In an agent setting, such hidden operational constraints can steer downstream behavior toward order execution or data-environment manipulation, creating scope expansion and increasing the chance of unsafe or unauthorized trading actions.

Description-Behavior Mismatch

High

Confidence: 99% confidence
Finding: This is a true vulnerability because the seed's declared use cases, intent routing, preconditions, and execution flow target Sphinx documentation workflows rather than the advertised empyrical risk-metrics capability. That mismatch can cause the host to invoke the wrong automation path, install or rely on unrelated tooling, and produce outputs or side effects far outside the user's expected scope, violating least surprise and expanding operational risk.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: This is a true vulnerability because the skill advertises broad backtesting and factor-research capabilities despite being described as a narrow risk-metrics calculator. Such capability inflation widens what the agent may attempt to do, increasing the chance of unintended code generation, market-data handling, storage, and trading-related actions in contexts where only offline metric computation was expected.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: This is a true vulnerability because external data fetching and local caching introduce network and persistence side effects that are unnecessary for a pure risk-metrics computation skill. Unneeded fetch/cache features increase attack surface, can create privacy or compliance issues, and may surprise users by persisting financial data locally.

Intent-Code Divergence

High

Confidence: 99% confidence
Finding: This is a true vulnerability because the user-facing documentation actively steers users into a different product domain: ZVT-based A-share strategy development and backtesting rather than empyrical risk metrics. Misleading summaries materially increase the chance of unsafe or unauthorized execution, especially when users rely on post-install and human-summary text to understand what a skill will do.

Vague Triggers

Medium

Confidence: 90% confidence
Finding: The execute trigger activates on broad intent matching plus generic action verbs like run, execute, backtest, fetch, or collect. This ambiguous trigger can cause accidental or unauthorized invocation in unrelated contexts, especially because the skill content already mixes analytics, documentation, and trading concepts.

Vague Triggers

Medium

Confidence: 89% confidence
Finding: This is a true vulnerability because the execute trigger is broad enough to match generic action verbs and intent terms, which can cause accidental invocation of the skill in unrelated conversations. In a mis-scoped skill like this one, over-triggering is more dangerous because unintended execution may enter trading, data-fetching, or documentation workflows the user did not request.

Natural-Language Policy Violations

Medium

Confidence: 86% confidence
Finding: This is a true vulnerability because the natural-language guidance imposes an A-share/ZVT framing without clear user opt-in, biasing the assistant toward one market and toolchain. In a skill already suffering from scope drift, this can redirect benign metric-analysis requests into unwanted market-specific workflows and assumptions.

Natural-Language Policy Violations

Medium

Confidence: 87% confidence
Finding: This is a true vulnerability because the human summary defaults users into A-share usage and discourages other markets before eliciting requirements. That creates unsafe hidden assumptions and may distort annualization, benchmark, timezone, and data conventions, especially problematic for a skill that should primarily compute generic risk metrics.

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal

Static analysis

No suspicious patterns detected.