Corpus Builder

Security checks across malware telemetry and agentic risk

Overview

This skill does what it says: it builds a local text corpus and can optionally send text to DashScope for AI annotation, with some operational risks users should manage.

Install only if you are comfortable with the skill creating local corpus files that may contain your original text and embeddings. Unset DASHSCOPE_API_KEY to keep annotation offline; if you use LLM mode, assume text chunks are sent to DashScope. Prefer a secret manager or temporary environment variable over storing API keys in ~/.bashrc, and verify any rm -rf path before running cleanup commands.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Tool MisuseTool Parameter Abuse, Chaining Abuse, Unsafe Defaults
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (14)

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The README instructs users to persist an API key in ~/.bashrc, which can increase the chance of credential exposure through shell history, dotfile backups, shared accounts, or accidental publication of startup files. While this is common operational guidance rather than an exploit, documenting secret persistence without any warning or safer alternatives creates avoidable credential-handling risk.

Missing User Warnings

Medium

Confidence: 85% confidence
Finding: The documentation recommends rm -rf on a storage path without clearly warning that the action is destructive and irreversible. If users substitute the collection name incorrectly, run from an unexpected directory, or adapt the command carelessly, it can lead to unintended data loss beyond the vector store.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The example explicitly instructs users to run a destructive `rm -rf` against the vector store but does not clearly warn that this will permanently delete indexed data and require a rebuild. In documentation for an agent skill, users may copy-paste commands directly, so omission of a data-loss warning creates a real operational safety issue even if it is not a code-execution exploit.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The troubleshooting guide recommends destructive deletion commands (`rm -rf`) to clear caches and remove corpus data, but it does not warn about irreversible data loss, scope the paths carefully, or suggest verification before execution. In a user-facing operational document, this can lead to accidental deletion of valuable local data if paths are mistyped, expanded unexpectedly, or copied without understanding.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: The API key setup guidance tells users to export, persist, and print a secret value without warning about credential exposure risks. Persisting secrets in shell startup files and echoing them to the terminal can leak credentials through shell history, screen sharing, shoulder surfing, backups, or overly broad file permissions.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The annotator sends user-supplied text content to a third-party LLM service without any consent check, sensitivity screening, or clear warning at the transmission path. If users process proprietary, personal, or regulated text, this can cause unintended external disclosure and compliance violations even though the endpoint is legitimate.

Unpinned Dependencies

Low

Category: Supply Chain
Content: # Corpus Builder - Requirements # ChromaDB 向量数据库 chromadb>=0.5.0 # 嵌入模型（语义向量化） sentence-transformers>=2.2.2
Confidence: 91% confidence
Finding: chromadb>=0.5.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: chromadb>=0.5.0 # 嵌入模型（语义向量化） sentence-transformers>=2.2.2 # 配置文件解析 pyyaml>=6.0.1
Confidence: 91% confidence
Finding: sentence-transformers>=2.2.2

Unpinned Dependencies

Low

Category: Supply Chain
Content: sentence-transformers>=2.2.2 # 配置文件解析 pyyaml>=6.0.1 # CLI 美化输出 rich>=13.7.0
Confidence: 96% confidence
Finding: pyyaml>=6.0.1

Unpinned Dependencies

Low

Category: Supply Chain
Content: pyyaml>=6.0.1 # CLI 美化输出 rich>=13.7.0 # 内存监控 psutil>=5.9.8
Confidence: 87% confidence
Finding: rich>=13.7.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: rich>=13.7.0 # 内存监控 psutil>=5.9.8 # sqlite3 版本兼容（ChromaDB 需要 sqlite3 >= 3.35.0） # 如果系统 sqlite3 版本过低，安装此包作为替代
Confidence: 95% confidence
Finding: psutil>=5.9.8

Unpinned Dependencies

Low

Category: Supply Chain
Content: # sqlite3 版本兼容（ChromaDB 需要 sqlite3 >= 3.35.0） # 如果系统 sqlite3 版本过低，安装此包作为替代 pysqlite3-binary>=0.5.2 # LLM API 调用（AI 标注） # 使用 OpenAI 兼容 API 调用 DashScope Coding
Confidence: 90% confidence
Finding: pysqlite3-binary>=0.5.2

Unpinned Dependencies

Low

Category: Supply Chain
Content: # LLM API 调用（AI 标注） # 使用 OpenAI 兼容 API 调用 DashScope Coding openai>=1.0.0
Confidence: 94% confidence
Finding: openai>=1.0.0

Tool Parameter Abuse

High

Category: Tool Misuse
Content: ```bash # 删除向量库重新构建 rm -rf corpus/chroma/{collection_name} python3 scripts/build_corpus.py --source ~/novels/reference --name test ```
Confidence: 97% confidence
Finding: rm -rf corpus/chroma/

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal