doc-extract-filter
Analysis
The artifacts show a purpose-aligned document extraction tool; the main risk is that it can read and save full local document contents when asked, especially in batch mode.
Findings (3)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Checks for instructions or behavior that redirect the agent, misuse tools, execute unexpected code, cascade across systems, exploit user trust, or continue outside the intended task.
for file_path in input_dir.rglob('*'):
if file_path.is_file() and file_path.suffix.lower() in supported_extensions:
files_to_process.append(file_path)
...
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)Batch mode recursively collects supported files from the chosen directory and writes extracted results to JSON files.
python-markdown # 用于 Markdown 文件处理 beautifulsoup4 # 用于从 HTML 中提取文本 ... pytesseract; python_version >= "3.6"
Several dependencies are listed without exact version pins, unlike the core pinned packages, which can lead to dependency drift during installation.
Checks for exposed credentials, poisoned memory or context, unclear communication boundaries, or sensitive data that could leave the user's control.
"data": {
"text": text,
"filtered_text": filtered_text,
"matches": filter_result.get("results", [])Filter mode returns the full extracted document text in addition to filtered matches, so private content may enter the agent context even when the user asked for filtering.
