Corpus Builder

Security checks across malware telemetry and agentic risk

Overview

This skill does what it says: it builds a local text corpus and can optionally send text to DashScope for AI annotation, with some operational risks users should manage.

Install only if you are comfortable with the skill creating local corpus files that may contain your original text and embeddings. Unset DASHSCOPE_API_KEY to keep annotation offline; if you use LLM mode, assume text chunks are sent to DashScope. Prefer a secret manager or temporary environment variable over storing API keys in ~/.bashrc, and verify any rm -rf path before running cleanup commands.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
  • Tool MisuseTool Parameter Abuse, Chaining Abuse, Unsafe Defaults
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
Findings (14)

Missing User Warnings

Medium
Confidence
89% confidence
Finding
The README instructs users to persist an API key in ~/.bashrc, which can increase the chance of credential exposure through shell history, dotfile backups, shared accounts, or accidental publication of startup files. While this is common operational guidance rather than an exploit, documenting secret persistence without any warning or safer alternatives creates avoidable credential-handling risk.

Missing User Warnings

Medium
Confidence
85% confidence
Finding
The documentation recommends rm -rf on a storage path without clearly warning that the action is destructive and irreversible. If users substitute the collection name incorrectly, run from an unexpected directory, or adapt the command carelessly, it can lead to unintended data loss beyond the vector store.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The example explicitly instructs users to run a destructive `rm -rf` against the vector store but does not clearly warn that this will permanently delete indexed data and require a rebuild. In documentation for an agent skill, users may copy-paste commands directly, so omission of a data-loss warning creates a real operational safety issue even if it is not a code-execution exploit.

Missing User Warnings

Medium
Confidence
90% confidence
Finding
The troubleshooting guide recommends destructive deletion commands (`rm -rf`) to clear caches and remove corpus data, but it does not warn about irreversible data loss, scope the paths carefully, or suggest verification before execution. In a user-facing operational document, this can lead to accidental deletion of valuable local data if paths are mistyped, expanded unexpectedly, or copied without understanding.

Missing User Warnings

Medium
Confidence
87% confidence
Finding
The API key setup guidance tells users to export, persist, and print a secret value without warning about credential exposure risks. Persisting secrets in shell startup files and echoing them to the terminal can leak credentials through shell history, screen sharing, shoulder surfing, backups, or overly broad file permissions.

Missing User Warnings

Medium
Confidence
96% confidence
Finding
The annotator sends user-supplied text content to a third-party LLM service without any consent check, sensitivity screening, or clear warning at the transmission path. If users process proprietary, personal, or regulated text, this can cause unintended external disclosure and compliance violations even though the endpoint is legitimate.

Unpinned Dependencies

Low
Category
Supply Chain
Content
# Corpus Builder - Requirements

# ChromaDB 向量数据库
chromadb>=0.5.0

# 嵌入模型(语义向量化)
sentence-transformers>=2.2.2
Confidence
91% confidence
Finding
chromadb>=0.5.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
chromadb>=0.5.0

# 嵌入模型(语义向量化)
sentence-transformers>=2.2.2

# 配置文件解析
pyyaml>=6.0.1
Confidence
91% confidence
Finding
sentence-transformers>=2.2.2

Unpinned Dependencies

Low
Category
Supply Chain
Content
sentence-transformers>=2.2.2

# 配置文件解析
pyyaml>=6.0.1

# CLI 美化输出
rich>=13.7.0
Confidence
96% confidence
Finding
pyyaml>=6.0.1

Unpinned Dependencies

Low
Category
Supply Chain
Content
pyyaml>=6.0.1

# CLI 美化输出
rich>=13.7.0

# 内存监控
psutil>=5.9.8
Confidence
87% confidence
Finding
rich>=13.7.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
rich>=13.7.0

# 内存监控
psutil>=5.9.8

# sqlite3 版本兼容(ChromaDB 需要 sqlite3 >= 3.35.0)
# 如果系统 sqlite3 版本过低,安装此包作为替代
Confidence
95% confidence
Finding
psutil>=5.9.8

Unpinned Dependencies

Low
Category
Supply Chain
Content
# sqlite3 版本兼容(ChromaDB 需要 sqlite3 >= 3.35.0)
# 如果系统 sqlite3 版本过低,安装此包作为替代
pysqlite3-binary>=0.5.2

# LLM API 调用(AI 标注)
# 使用 OpenAI 兼容 API 调用 DashScope Coding
Confidence
90% confidence
Finding
pysqlite3-binary>=0.5.2

Unpinned Dependencies

Low
Category
Supply Chain
Content
# LLM API 调用(AI 标注)
# 使用 OpenAI 兼容 API 调用 DashScope Coding
openai>=1.0.0
Confidence
94% confidence
Finding
openai>=1.0.0

Tool Parameter Abuse

High
Category
Tool Misuse
Content
```bash
# 删除向量库重新构建
rm -rf corpus/chroma/{collection_name}
python3 scripts/build_corpus.py --source ~/novels/reference --name test
```
Confidence
97% confidence
Finding
rm -rf corpus/chroma/

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal