Benchmark Store

Security checks across malware telemetry and agentic risk

Overview

Benchmark Store is a coherent benchmark and Pareto-reference skill, but users should treat its benchmark files and hidden-test utilities as sensitive evaluation infrastructure.

Install only if you need local benchmark, Pareto, or hidden-test reference tooling. Keep benchmark databases and Pareto state in a scoped directory, back them up before add/delete operations, protect hidden-test files and passwords, and do not expose proposer-facing code to raw hidden-test objects.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration

Findings (9)

Description-Behavior Mismatch

High

Confidence: 96% confidence
Finding: The file is a broad, executable-style test corpus for generic tool, process, analysis, creation, evaluation, and red-team behaviors, which conflicts with benchmark-store’s declared role as a benchmark/history/Pareto reference store rather than a scoring or general evaluation skill. This scope mismatch is dangerous because downstream agents or orchestration layers may treat these tests as authoritative for this skill and unintentionally enable benchmark-store to perform or influence candidate evaluation outside its approved boundaries.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The presence of a 'Skill 评估测试' directly contradicts the manifest statement that benchmark-store must not be used to score candidates. This is dangerous because it creates policy ambiguity: evaluators or routing logic may invoke benchmark-store for ranking or grading, undermining separation of duties and potentially contaminating benchmark integrity.

Description-Behavior Mismatch

Medium

Confidence: 89% confidence
Finding: Including security-scan and red-team tests expands the apparent capability of benchmark-store beyond passive benchmark retrieval into active security evaluation. This can cause unsafe routing and privilege creep, where a skill intended to serve reference data is treated as a security-analysis or adversarial-testing component.

Intent-Code Divergence

Medium

Confidence: 80% confidence
Finding: The footer states the test library was generated by 'skill-evaluator', creating ownership and intent confusion relative to benchmark-store. While not a direct exploit by itself, provenance confusion is dangerous in multi-skill systems because assets may be copied or trusted under the wrong authority, leading to misrouting or accidental cross-skill behavior inheritance.

Context-Inappropriate Capability

Low

Confidence: 87% confidence
Finding: The export function writes benchmark results to any caller-provided filesystem path with no path restriction, overwrite policy, or trust check. In an agent setting, this can enable unintended file creation or clobbering in sensitive locations, especially if the path is influenced by untrusted input.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The suite exposes `get_visible_tests(role)` and `is_visible_to(role)` APIs that can return full `HiddenTest` objects, including encrypted payloads and salts, based only on a mutable string boundary. In a benchmarking context, this broad distribution interface undermines hidden-test isolation and creates a pathway for test-set exfiltration or offline analysis, even if the contents are still encrypted at rest.

Context-Inappropriate Capability

High

Confidence: 98% confidence
Finding: Allowing the visibility boundary to be set to `proposer` means the same component can intentionally disclose hidden tests to the role being evaluated against them. In this skill's stated purpose—baseline comparison and evaluation standards rather than hidden-test distribution—that capability directly increases the risk of test leakage, overfitting, and benchmark contamination.

Intent-Code Divergence

Medium

Confidence: 83% confidence
Finding: The documentation claims tests remain encrypted and are only temporarily decrypted during execution, but the class also includes role-based APIs that can hand out stored `HiddenTest` objects directly once the boundary is changed. This mismatch can cause integrators to rely on security properties the implementation does not actually guarantee, increasing the chance of accidental leakage or unsafe embedding into broader systems.

Vague Triggers

Medium

Confidence: 87% confidence
Finding: The test cases use broad natural-language inputs without trigger guards, exclusions, or routing constraints, so generic requests could be interpreted as in-scope for benchmark-store. In an agent environment, this increases the chance that benchmark-store is activated for unrelated tasks, causing overbroad behavior and bypass of intended skill boundaries.

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal