Security audit

Rag Evaluator

Security checks across malware telemetry and agentic risk

Overview

This appears to be a local RAG logging tool that saves user-entered data on disk; its description is inconsistent, but there is no evidence of exfiltration, credential use, or destructive behavior.

Install only if you want a local Bash-based RAG experiment logger. Do not enter API keys, credentials, private customer data, or sensitive prompts unless you are comfortable with them being saved under ~/.local/share/rag-evaluator and included in exports. Verify how the rag-evaluator command is installed or linked, because the package does not provide a clear install mechanism.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (9)

Tp4

High

Category: MCP Tool Poisoning
Confidence: 95% confidence
Finding: The skill metadata advertises a Python SDK for agent observability, but the documented behavior is a Bash CLI that persistently logs arbitrary command arguments to local files, supports search, export, and history inspection, and targets RAG evaluation specifically. This mismatch can mislead users and downstream systems into invoking the skill with sensitive prompts, tokens, or experiment data under the assumption it is a benign SDK wrapper, causing unintended local data retention and disclosure.

Description-Behavior Mismatch

Medium

Confidence: 88% confidence
Finding: The manifest and heading position the skill as a Python SDK for observability, while the body documents a local Bash logging utility for RAG evaluation. In agent ecosystems, this kind of interface/behavior mismatch increases the chance that users or orchestrators supply sensitive operational data to a tool whose real function is local persistence and later export, creating avoidable confidentiality risk.

Description-Behavior Mismatch

Medium

Confidence: 85% confidence
Finding: The declared scope is broad agent observability, but the actual documented functionality is much narrower: a local RAG pipeline evaluator/logger. While this is less severe than direct code execution, scope inflation can still cause inappropriate deployment decisions and over-trust, especially if a caller assumes standard observability controls rather than simple file-backed logging.

Description-Behavior Mismatch

High

Confidence: 95% confidence
Finding: The script's behavior is materially inconsistent with the declared skill purpose: instead of a Python observability/evaluation SDK, it implements a generic Bash-based local data capture and journaling tool. This mismatch is dangerous because users may trust and invoke it under a benign SDK context while it persistently collects arbitrary free-form inputs and stores them on disk, increasing the likelihood of unintended sensitive data capture.

Context-Inappropriate Capability

Medium

Confidence: 86% confidence
Finding: The export, search, status, and aggregation features create a broad local data collection surface that is not clearly necessary for an observability SDK helper. Even without network exfiltration, generic accumulation and re-packaging of arbitrary inputs can expose prompts, secrets, or internal data entered during normal use and make later disclosure easier.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: The documentation says commands log inputs and store them persistently, but it does not prominently warn users that arbitrary command arguments will be written to local files and can later be searched or exported. In the context of RAG evaluation, users commonly pass prompts, retrieved snippets, model outputs, and cost data that may contain secrets, proprietary text, or personal data, so insufficient warning materially increases data leakage risk.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: Arbitrary user input is written verbatim to persistent log files with no clear disclosure at entry time. In agent and evaluation workflows, users often paste prompts, API responses, tokens, configuration details, or proprietary text, so silent persistence materially increases the risk of local sensitive-data exposure.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The export function consolidates all previously stored logs into a single file without warning that potentially sensitive historical entries will be recopied. Consolidation increases exposure because data that was previously dispersed across multiple files becomes easier to share, inspect, leak, or ingest into other tooling.

Ssd 3

Medium

Confidence: 88% confidence
Finding: The design persistently records user-provided inputs and then makes them broadly re-exposable through recent, status, search, and export workflows. In the context of an agent skill, this is more dangerous because users may enter operational prompts, internal documents, or credentials while expecting an evaluation tool rather than a journaling database.

VirusTotal

46/46 vendors flagged this skill as clean.

View on VirusTotal

Static analysis

No suspicious patterns detected.