doc-extract-filter
Security checks across static analysis, malware telemetry, and agentic risk
Overview
The artifacts show a purpose-aligned document extraction tool; the main risk is that it can read and save full local document contents when asked, especially in batch mode.
This skill appears appropriate for extracting and filtering document text. Before installing or using it, make sure you are comfortable sharing the selected files with the agent, avoid broad batch directories, and review or pin dependencies if using it in a controlled environment.
Static analysis
No static analysis findings were reported for this release.
VirusTotal
VirusTotal findings are pending for this skill version.
Risk analysis
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
If you filter a sensitive document, the assistant may still receive the entire document, not only the matching snippets.
Filter mode returns the full extracted document text in addition to filtered matches, so private content may enter the agent context even when the user asked for filtering.
"data": {
"text": text,
"filtered_text": filtered_text,
"matches": filter_result.get("results", [])Use the skill only on documents you intend to share with the agent, and consider changing filter mode to return only matches if you need stricter privacy.
Choosing a broad folder could create extracted-text copies of many local documents.
Batch mode recursively collects supported files from the chosen directory and writes extracted results to JSON files.
for file_path in input_dir.rglob('*'):
if file_path.is_file() and file_path.suffix.lower() in supported_extensions:
files_to_process.append(file_path)
...
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)Use narrow input directories, choose a controlled output directory, and avoid pointing batch mode at home, workspace, or shared folders unless that is intended.
Future installs may resolve different package versions than the author tested.
Several dependencies are listed without exact version pins, unlike the core pinned packages, which can lead to dependency drift during installation.
python-markdown # 用于 Markdown 文件处理 beautifulsoup4 # 用于从 HTML 中提取文本 ... pytesseract; python_version >= "3.6"
Pin and verify dependency versions before installing in a sensitive environment.
