Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Ontology Engineer

Extract candidate ontology models from enterprise business systems AND build/maintain personal knowledge graphs from any file system. Use when: ontology extr...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 28 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The name/description align with the included scripts and reference docs: the code scans directories, converts Office/PDF files, extracts tables and builds graph outputs. However the registry summary claims "No required binaries / env / install spec", while SKILL.md and scripts explicitly expect python3, optional LibreOffice/Word, and several Python packages (python-docx, PyMuPDF, openpyxl, xlrd, python-pptx, Pillow, pyyaml). That mismatch between declared registry requirements and the runtime instructions is an inconsistency developers should justify.
!
Instruction Scope
The SKILL.md directs scanning arbitrary directories and reading many file types (.doc/.docx/.pdf/.xlsx/.pptx/.csv/.sql etc.). The document states a mandatory user-scoped confirmation (Step 1.5) before reading content, but that is an operational promise rather than a technical enforcement inside the scripts. The scripts perform conversion and subprocess calls (LibreOffice, COM), filesystem traversal and extraction — high-scope actions that require explicit user oversight. If the interactive confirmation is not enforced by the agent wrapper, the scripts could read large parts of the user's files.
Install Mechanism
Registry metadata indicated 'No install spec' but SKILL.md contains an 'install' block and explicit pip install recommendations. The code relies on third-party Python packages and optional system binaries (libreoffice, Microsoft Word via COM). There is no opaque network download host in the files shown, but installing the listed packages will pull from public PyPI — a normal but non-trivial install step that the registry should advertise. The mismatch (no top-level install spec vs SKILL.md requirements) is inconsistent.
Credentials
The skill does not request environment variables, credentials, or config paths. All declared operations are local file processing and standard system tools. No API keys or remote endpoints are required per SKILL.md.
Persistence & Privilege
The skill is not force-installed (always: false) and does not request to modify other skills or system-wide settings. It writes append-only output files to a user-specified directory per its design. Note: autonomous invocation (default enabled) combined with broad filesystem read capability increases potential impact if the agent runs without manual gatekeeping — pair this with the SKILL.md's Step 1.5 requirement.
What to consider before installing
This skill is broadly consistent with an on-device ontology/graph extractor, but exercise caution before running it against your real data. Key points to consider: - Metadata mismatch: The registry summary claims no install/requirements, but SKILL.md and scripts expect python3, optional LibreOffice or MS Word, and multiple pip packages. Request or confirm an explicit install spec before proceeding. - Filesystem scope: The scripts are designed to scan arbitrary directories and extract structured data from many file types. The SKILL.md promises a mandatory interactive 'Step 1.5' confirmation before reading files — verify the agent integration actually enforces this checkpoint (i.e., the agent must ask you and not auto-run scans). - Run safely: First run in a sandbox or container, or point the scripts at a small test folder. Use the provided dry-run options (scan_filesystem.py --dry-run, etc.) to inspect what would be processed. Do not point it at /, your home directory, or backups until you are confident of behavior. - Dependency installs: Installing the advertised Python packages pulls code from PyPI. If you plan to install, prefer doing so in an isolated virtualenv/container and review package versions. On Windows, COM automation requires Word and pywin32; on Linux/macOS LibreOffice is required for legacy .doc/.wps conversion — these are system-level dependencies. - Network & exfiltration checks: SKILL.md states 'no external API calls', and no remote endpoints are visible in the provided files. Still, before running, scan the code for network libraries (requests, urllib, socket) and monitor outbound connections (netstat) during a test run. - Review scripts for enforcement of user consent: Ensure the agent wrapper or local runner does not bypass Step 1.5. If the code does not enforce an interactive prompt, require the agent to ask for folder approval before running any extraction commands. - If in doubt: ask the publisher/maintainer for clarifications about the registry/install mismatch, whether the agent enforces the interactive confirmation, and provide an explicit install manifest (requirements.txt or setup script) and a short audit of network behavior. If you must run it on sensitive data, do so only in an isolated environment after these checks.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.1.1
Download zip
bfovk97bena6awmyvvhbhey7s16b41830wqadatabasevk97bena6awmyvvhbhey7s16b41830wqaenterprisevk97bena6awmyvvhbhey7s16b41830wqaknowledge-graphvk97bena6awmyvvhbhey7s16b41830wqalatestvk97bena6awmyvvhbhey7s16b41830wqaoffice-documentsvk97bena6awmyvvhbhey7s16b41830wqaontologyvk97bena6awmyvvhbhey7s16b41830wqapersonal-knowledgevk97bena6awmyvvhbhey7s16b41830wqasqlvk97bena6awmyvvhbhey7s16b41830wqa

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Ontology Engineer

Extract candidate ontology models from existing data. Build and maintain personal knowledge graphs.

Core principle: Make implicit business models in existing data explicit. Don't create from scratch.

Division of labor: Scripts handle mechanical extraction (file scanning, format conversion, table parsing). LLM handles semantic judgment (entity identification, property selection, relationship discovery, naming, cross-source merging).

Security model:

  • No external API calls. The LLM running this skill (Claude, OpenClaw, etc.) IS the semantic engine. No credentials, no network endpoints, no data exfiltration paths.
  • User-scoped scanning. Step 1.5 is a MANDATORY interactive checkpoint — the user reviews and approves every folder before any content is read. Nothing is analyzed without explicit confirmation.
  • Local-only output. All artifacts (graph.jsonl, schema.yaml, review.md) are written to a user-specified local directory. No data leaves the machine.
  • Append-only writes. Scripts only create/append files. No deletion, no modification of existing user files.

When This Skill Adds Value (and When It Doesn't)

Knowledge graphs and ontology extraction are not universally useful. Before starting, assess fit:

ScenarioValueWhy
3+ heterogeneous systems with inconsistent naming for the same conceptsHigh (Mode A)Cross-system concept alignment is the core use case
Agent product needs factual grounding to reduce hallucinationHigh (Mode B/C)Graph becomes Agent's fact base — auto-query before every response
1000+ entities with dense relationships across long time spansHighPattern discovery humans can't do manually (churn, cross-sell, capability mapping)
Client consulting engagement analyzing their data landscapeHigh (Mode A)Core consulting deliverable: "here's what your data assets look like"
Small org, <200 entities, info fits in one person's head + ExcelLow (Mode B)Graph just re-stores what user already knows — use as PoC/capability validation only
Single system, no cross-system integration needLow (Mode A)Read the schema directly; ontology layer adds overhead without value

Rule of thumb: If the user's reaction to the output is "I already knew all this", the graph isn't producing incremental value. Redirect to Mode A (client projects) or Agent integration.

Detailed value scenarios: references/value-scenarios.md

Operating Modes

ModeInputOutputUse When
A: Database ExtractionSQL DDL, data dictionaries, Word/Excel schemasontology.json + review.mdAnalyzing enterprise business systems
B: Filesystem ScanningLocal/cloud directoriesgraph.jsonl + schema.yamlBuilding personal knowledge graph
C: External DataOthers' data spaces, shared drivesgraph.jsonl (source=external)Acquiring others' business models

Mode A: Database Extraction

Three-phase workflow for extracting ontology from structured data sources.

Phase 1: SCAN

Run scripts/scan_directory.py to discover and classify files by priority (P1-P7).

python scripts/scan_directory.py "<dir>" --output scan_result.json --report

Review scan report. Process P1-P2 files first, expand as needed.

Phase 2: EXTRACT

  1. Convert .doc files if needed: scripts/convert_doc.py
  2. Extract tables from Word/Excel: scripts/extract_tables.py
  3. Read extracted data, apply Rules 1-7 (see analysis-rules.md)
  4. For text formats (.sql, .json, .yaml, .md, .csv): read directly

Phase 3: MERGE

  1. Cross-source entity deduplication (Rule 5)
  2. Relationship consolidation
  3. Output: ontology.json + review.md

Detailed rules: references/analysis-rules.md (Rules 1-7) Quality checks: references/quality-checks.md Script details: references/script-operations.md Modeling decisions: references/modeling-decisions.md


Mode B/C: Knowledge Graph

Two-step pipeline for building personal knowledge graphs from file systems.

Step 1: File Indexing (script, no LLM)

python scripts/scan_filesystem.py --root /path --config namespace_rules.yaml --extract-metadata

Creates Document + Project entities in graph.jsonl. Pure mechanical operation.

Key features: Auto namespace inference, duplicate detection, .docx/.pdf metadata extraction, universal noise filtering.

Step 1.5: User Scope Confirmation (MANDATORY interactive step)

After Step 1 completes, present the scan summary to the user and ask for scope confirmation before proceeding to Step 2. The user knows which folders matter most.

Display a table of all discovered projects/namespaces with document counts, then ask:

扫描完成,发现 {N} 个项目,共 {M} 篇文档。请标记每个文件夹的优先级:
- 🔴 重点(高采样率,优先分析)
- ⚪ 普通(默认采样率)
- ⚫ 忽略(跳过,不分析)
- 或输入"全部"跳过选择,按默认策略处理所有文件夹

| # | 项目 | 文档数 | 格式分布 | 默认优先级 |
|---|------|--------|----------|-----------|
| 1 | work/myfiles | 15,617 | .doc .docx .pdf .xlsx | 🔴 重点 |
| 2 | work/classified | 1,578 | .doc .pdf .xlsx | ⚪ 普通 |
| ... | ... | ... | ... | ... |

请输入调整(如 "2=忽略, 5=重点")或 "全部" 或 "确认":

Rules:

  • User can mark any folder as 重点/普通/忽略
  • User can type "全部" to skip selection and use defaults
  • 重点 folders get 2-3x sampling rate, 忽略 folders are skipped entirely
  • Default priority is auto-inferred: human work folders=重点, AI-generated=普通, downloads/cache=忽略
  • Never skip this step. Even if obvious, let the user confirm.

Step 2: Semantic Analysis (LLM, core step)

Five phases: Sampling → Document Reading → Aggregation → Cross-project Alignment → Output.

Key decisions (details in knowledge-graph-workflow.md):

  • Minimum 10% coverage, 重点 folders 2-3x, 忽略 folders skip
  • Structured lists (Rule 13): Files named 列表/台账/名单/登记表/清单 etc. → full extraction (every row = one entity), NOT sampling. See analysis-rules.md Rule 13.
  • Dual-track extraction: Track A (named entities) + Track B (domain terms)
  • Subagents must be general-purpose type (Bash access). Never use Explore type.
  • Format tools: see formats-and-deps.md
  • Relation semantics: Use enriched relation format with direction, cardinality, temporal range. See relation-ontology.md.

Step 3: Runtime Evolution

Agent enriches the knowledge graph during daily conversations. source.type = "runtime".

When to trigger (passive, no user action needed):

  • User mentions a person by name + role/org → check graph, append if new
  • User discusses a project/event with dates → append Event
  • User makes a strategic decision or key insight → append Note
  • User mentions a new organization/client → append Organization

How to append:

python query_graph.py search "张三"  # Check if entity exists
# If not found, append to graph.jsonl:
echo '{"op":"create","ts":"...","entity":{"id":"per-NNNNN","type":"Person","graph":"core/persons","source":{"type":"runtime","conversation_id":"..."},...}}' >> graph.jsonl

Rules:

  • Only append entities with concrete evidence from the conversation
  • Never overwrite existing entities — only add new ones or note conflicts
  • Use source.type = "runtime" to distinguish from scan-derived entities
  • Keep it lightweight: 1-3 entities per conversation, not a full re-scan

Full workflow details: references/knowledge-graph-workflow.md Analysis rules (8-12): references/analysis-rules.md Format support & deps: references/formats-and-deps.md


Key Principles

  • Model business concepts, not database tables. Table names ≠ object names.
  • Extract then express. Make implicit models explicit, don't create from nothing.
  • Experts judge. Produce candidates; final decisions belong to humans. When in doubt, flag it.
  • Invest in invariants. Stable entities and relationships, not technical details.
  • Handle what exists. Real projects use Word and Excel. Adapt to the data.
  • Scripts extract, LLM analyzes. Mechanical extraction via Python. Semantic judgment via LLM.
  • Coverage over perfection. 60% of files at moderate depth beats 3% at maximum depth.
  • Generic skeleton + domain discovery. 8 core types (BFO-aligned). Domain types discovered by scanning.
  • Single source of truth. All data in one graph.jsonl. Soft partition via graph/labels/source.
  • Relations carry semantics. Direction, cardinality, temporal range, evidence. Not just type + target.
  • Append-only evolution. Never delete entities. Deprecate, reclassify, version.

Ontology Theory References

ReferenceWhen to Read
modeling-decisions.mdCore type boundaries, entity vs enum, promotion judgment
relation-ontology.mdRelation format, core relation catalog, ternary relations
ontology-evolution.mdSchema versioning, entity reclassification, conflict resolution
constraints-and-inference.mdType/relation constraints, inference rules, inconsistency detection
value-scenarios.mdWhen this skill adds value and when it doesn't

Output Formats

graph.jsonl (Mode B/C)

{"op":"create","ts":"2026-01-15T10:00:00Z","entity":{"id":"per-00001","type":"Person","graph":"core/persons","labels":["employee"],"source":{"type":"scan","scan_id":"step2-r1"},"properties":{"name":"张三","roles":["项目经理"],"organizations":["某科技公司"]},"relations":[{"type":"works_at","target_id":"org-00002","direction":"forward","cardinality":"N:1","temporal":{"start":"2019-01","end":null},"confidence":"high"}],"created_at":"2026-01-15T10:00:00Z"}}

Required: id, type, graph, source, created_at. Optional: labels, properties, relations.

Relation fields: type + target_id required. Optional: direction (forward/reverse/bidirectional), cardinality (1:1/1:N/N:1/N:M), temporal ({start, end}), evidence (source entity ID), confidence (high/medium/low). See relation-ontology.md.

schema.yaml (Mode B/C)

meta:
  version: "2.0"
core_types:       # 8 fixed (BFO-aligned): Person, Organization, Project, Task, Document, Event, Note, Goal
domain_types:     # Discovered by Step 2 Track B, grouped by domain
namespaces:       # core/, work/*, personal/*, external/*, uncategorized/*
source_types:     # scan | runtime | manual | email | cloud | chat
relation_schema:  # Relation fields: type, target_id, direction, cardinality, temporal, evidence, confidence
relation_types:   # Core relation catalog grouped by source type pair
constraints:      # type_constraints (required props, enums), relation_constraints, id_pattern
inference_rules:  # Transitive subsidiary, symmetric partner, inverse works_at, etc.
schema_evolution:  # Version format, backward compatibility rules

ontology.json (Mode A)

{
  "meta": {"generated_by": "ontology-engineer", "source_files": [], "domain": "..."},
  "object_types": [{"name": "...", "english": "...", "core_properties": [], "confidence": "high|medium|low"}],
  "link_types": [{"from": "A", "relation": "verb", "to": "B", "cardinality": "1:N", "evidence": "..."}],
  "review_flags": [{"type": "promotion|merge|ambiguity|missing", "item": "...", "question": "..."}]
}

review.md

  1. Scan summary 2. Model overview 3. Object catalog 4. Relationship map 5. Review items 6. Cross-source merges 7. Data quality notes 8. Decision log

Scripts

ScriptModePurpose
scripts/scan_filesystem.pyB/CFile indexing, namespace inference, metadata extraction
scripts/scan_directory.pyAFile discovery with P1-P7 priority classification
scripts/convert_doc.pyA.doc → .docx conversion
scripts/extract_tables.pyATable extraction from Word/Excel

Details: references/script-operations.md


Agent Integration

ComponentPurposeStatus
query_graph.pySearch entities by type/name/graph/labels, traverse relationsDone
Runtime writeAgent appends new entities during conversation (Step 3)Done
MCP ServerExpose graph as tools: search_entities, get_relationsPlanned
Prompt injectionAgent auto-queries graph for context before handling tasksPlanned

Query tool usage:

python query_graph.py stats                    # Overview
python query_graph.py search "关键词"           # Search
python query_graph.py type Person --limit 20   # By type
python query_graph.py get per-00001            # Details
python query_graph.py relations per-00001      # Relations
python query_graph.py domain --limit 30        # Domain terms
python query_graph.py export Person --format csv  # Export

Files

16 total
Select a file
Select a file to preview.

Comments

Loading comments…