safety-kb-import

v1.0.0

安全生产法规标准导入工具。当用户需要导入新法规或标准到知识库、PDF文本提取、条款拆分、批量导入、数据质量验证时使用。触发词:导入法规、添加标准、入库、导入知识库、补充标准、PDF提取文本、拆分条款、KB导入、safety-review import

0· 48·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for cyz9827/safety-kb-import.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "safety-kb-import" (cyz9827/safety-kb-import) from ClawHub.
Skill page: https://clawhub.ai/cyz9827/safety-kb-import
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install safety-kb-import

ClawHub CLI

Package manager switcher

npx clawhub@latest install safety-kb-import
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (importing regulations into a safety KB) matches the included script and SKILL.md: the tool extracts text, splits clauses, and inserts/updates records in a local SQLite database (regulations/clauses/std_registry). This is coherent with the stated purpose. Minor mismatch: SKILL.md shows a UNIX-style DB path (~/.openclaw-autoclaw/...), while the script's DEFAULT_DB_PATH uses a Windows-style backslash path; both point to a user-home DB but differ in formatting across OSes.
Instruction Scope
Runtime instructions focus on extracting text (pdfplumber, optional OCR), building a JSON manifest, testing clause-splitting, and importing into the local safety-review DB. These actions fall within the stated import scope. Items to note: the SKILL.md and script reference using other skills or tools (web_fetch, safety-kb-query) for fetch/check operations; the workflow also instructs fallback to non-official sources (PPT/网络来源) when OCR/content is unavailable — this is a procedural choice but should be documented when used. The script also looks for an environment variable KB_PATH as an override for the DB path (not declared in requires.env).
Install Mechanism
There is no packaged install spec (instruction-only with an included script). SKILL.md asks the user to pip-install pdfplumber and optional OCR libraries and to install the Tesseract engine if needed. This is a low-risk pattern, but because dependencies are installed manually, users should install them from trusted sources and be aware of system-level requirements for OCR.
Credentials
The skill declares no required credentials or config paths, which aligns with its local-DB import purpose. However, the script respects an undocumented environment variable KB_PATH to override the DB path. No cloud credentials or unrelated secrets are requested. The use of the user's home directory DB implies the skill will read and write local files — appropriate for an importer but worth noting.
Persistence & Privilege
The skill is not marked 'always', and normal autonomous invocation is allowed. It only targets the specified local knowledge DB for inserts/updates; it does not request system-wide privileges or attempt to modify other skills' configurations. The main privilege is write access to the local SQLite file, which is expected for this tool.
Assessment
This package appears to do what it says (local import of regulations into a safety-review SQLite DB), but take these precautions before using it: 1) Backup the target database (~/.openclaw-autoclaw/... or the Windows-equivalent path) before running imports — the tool updates/overwrites records. 2) Be aware the script will read and write your local DB and may be pointed to a different DB via the KB_PATH env var (not declared in the SKILL.md); verify KB_PATH is not set to an unexpected location. 3) Review the manifest JSON carefully (titles/full_text) before import to avoid importing incorrect or unofficial content; the workflow allows fallback to non-official sources (PPT/网络来源) when official text is unavailable. 4) Install PDF/OCR dependencies (pdfplumber, pdf2image, pytesseract + system Tesseract) from trusted sources. 5) If you obtained this skill from an unknown source, consider auditing the full script (scripts/kb_import.py) for any behavior you don't expect or running it in a sandboxed environment first. If you want higher assurance, ask the author to declare KB_PATH in requires.env and to provide an install spec or signed release.

Like a lobster shell, security has layers — review code before you run it.

latestvk977hff692q9y9skr9rsnn1mrh85gcrt
48downloads
0stars
1versions
Updated 2d ago
v1.0.0
MIT-0

Safety KB Import — 安全生产法规标准导入工具

Overview

This skill provides standardized, safe import workflows for adding regulations, standards, and policy documents into the safety-review knowledge base (SQLite). It handles multi-source text extraction, smart clause splitting, conflict detection, and three-table atomic writes (regulations + clauses + std_registry).

Database location: ~/.openclaw-autoclaw/skills/safety-review/db/knowledge.db

When to Use This Skill

  • User wants to add new standards/regulations to the knowledge base
  • User has PDF files that need text extraction before import
  • User needs to batch-import multiple standards at once
  • User asks about importing courseware-referenced standards that are missing
  • Any write operation on the safety-review database

Trigger phrases (Chinese): 导入法规、入库、添加标准、补充知识库、PDF提取、拆分条款、批量导入

Companion skill: Use safety-kb-query first to check what's already in the database before importing.

Prerequisites

  1. Detect Python command:

    python --version
    
  2. Required packages for PDF extraction:

    pip install pdfplumber
    

    For OCR of scanned PDFs:

    pip install pdf2image pytesseract
    # Also requires Tesseract OCR engine installed on system
    

Import Workflow (Complete)

Phase 1: Preparation — Check What's Needed ⭐ Always Do This First

Before importing anything, use safety-kb-query to identify gaps:

python <kb_query_path>/kb_query.py check "GB 16423" "AQ/T 2033" "AQ 2034"

This prevents duplicates and identifies data quality issues.

Phase 2: Text Extraction

Option A: From PDF Files

python scripts/kb_import.py extract-pdf /path/to/document.pdf

Response fields:

  • success: boolean
  • text: extracted full text (empty if scan-only)
  • char_count: number of characters extracted
  • page_count: total pages
  • is_scan_only: true if PDF is image-based (needs OCR)

If is_scan_only is true, the PDF is a scanned/image-based document:

  1. Try installing and using tesseract OCR
  2. If OCR unavailable, extract content from PPT lecture materials or web sources as fallback
  3. Document the source as "PPT整理" or "网络来源" rather than official text

Option B: From Web Sources

Use web_fetch to get full text from government websites, wikisource, etc. Common reliable sources:

  • 维基文库 (wikisource.org) — full text of laws/policies
  • 政府公报 (gov.cn/gongbao) — official gazette versions
  • 部委官网 — original standard publications

Option C: From Existing Documents (.docx, .pptx)

Extract text from these formats using appropriate libraries (python-docx, python-pptx) or the respective skills.

Phase 3: Create Import Manifest

Create a JSON manifest file listing all items to import:

{
  "items": [
    {
      "title": "金属非金属矿山安全规程",
      "document_number": "GB 16423—2020",
      "issuing_authority": "国家市场监督管理总局",
      "authority_level": "national",
      "effective_date": "2021-09-01",
      "status": "current",
      "domains": "矿山安全",
      "category": "国标",
      "full_text": "<complete extracted text here>",
      "source_url": "",
      "page_count": 70,
      "clause_split_pattern": "standard"
    },
    {
      "title": "国务院关于进一步加强企业安全生产工作的通知",
      "document_number": "国发〔2010〕23号",
      "issuing_authority": "国务院",
      "authority_level": "national",
      "effective_date": "2010-07-23",
      "status": "current",
      "domains": "安全生产",
      "category": "政策文件",
      "full_text": "<complete text>",
      "source_url": "https://zh.wikisource.org/...",
      "page_count": 5,
      "clause_split_pattern": "policy"
    }
  ]
}

Manifest Field Reference

FieldRequiredDescription
titleFull title of the regulation/standard
document_numberStandard number (GB XXXX, AQ/T XXXX, 国发[X]X号)
issuing_authorityIssuing agency (default: "")
authority_levelOne of: national, ministerial, local
effective_dateISO date format YYYY-MM-DD
statuscurrent (default), superseded, draft, repealed
domainsDomain category (e.g., "矿山安全")
categoryType: "国标", "行标", "政策文件", "地方文件"
full_textComplete text content for clause splitting
source_urlOriginal source URL for attribution
page_countNumber of pages (for reference)
clause_split_patternstandard (default), policy, raw_lines

Clause Splitting Patterns

The tool supports three splitting strategies — choose based on document type:

PatternBest ForHow It Works
standardGB/AQ national/industry standardsRecognizes chapters (第X章), sections (N.N), sub-sections (N.N.N), appendixes
policyGovernment notices, State Council documentsRecognizes Chinese numbering (一、二、(一)、1.)
raw_linesUnstructured text, fallbackSplits by non-empty lines

Test splitting before full import:

python scripts/kb_import.py split-clauses --text "$SAMPLE_TEXT" --pattern standard

Phase 4: Execute Import

python scripts/kb_import.py import --json manifest.json

What happens during import:

  1. For each item in the manifest:

    • Searches existing regulations by document_number
    • If found → UPDATE (overwrite existing data)
    • If not found → INSERT (create new record)
  2. Clause processing:

    • Deletes old clauses (if updating)
    • Re-splits full_text using specified pattern
    • Inserts new clause records linked to regulation ID
  3. std_registry registration (automatic):

    • If document_number starts with GB/AQ → auto-registers in std_registry table
    • Skips if already registered

Output includes per-item status:

  • created — New record inserted
  • updated — Existing record overwritten
  • skipped — (reserved for future skip logic)
  • error — Database error with message

Phase 5: Post-Import Validation

Always validate after importing:

# Validate specific imported records
python scripts/kb_import.py validate <regulation_id>

# Check overall data quality
python <kb_query_path>/kb_query.py conflicts

# Verify it's findable
python <kb_query_path>/kb_query.py search "<document_number>"

Handling Special Cases

Scanned/Image-Based PDFs (No Extractable Text)

When extract-pdf returns "is_scan_only": true:

  1. First choice: Install tesseract and run OCR
  2. Second choice: Find text version from web sources (government sites, wikisource)
  3. Third choice: Extract from related PPT/lecture materials (document as "PPT整理")
  4. Last resort: Skip or note as "待补充官方全文"

Important: When using non-official sources (PPT, web scraping), always note this in the source_url field so data provenance is tracked.

Large Standards (e.g., GB 16423 with 80K+ characters)

No special handling needed — the tool processes them normally. Clause count may be high (2000+). Consider using --pattern standard for best results.

Batch Imports (10+ Items)

Split manifests into batches of 5-10 items each. Run sequentially. This makes error isolation easier.

Conflict: Existing Record Has Wrong Data

The tool will overwrite any existing record matching the document_number. Before overwriting:

  1. Use safety-kb-query info <id> to check current data
  2. If current data looks correct (different standard sharing similar number?), abort and investigate
  3. The conflicts command in safety-kb-query can help identify mismatched records proactively

Complete Example: Importing Courseware-Referenced Standards

This is the canonical workflow when a user says "the standards referenced in my training material aren't in the database":

Step 1: Extract references from user's document
        → List: [GB 16423-2020, AQ/T 2033-2023, AQ 2034, 国发[2010]23号]

Step 2: Gap analysis
        $ python kb_query.py check GB16423 AQ2033 AQ2034 "国发[2010]23号"
        → Found: 1, Missing: 3, Issues: 1 (ID:94 has wrong data)

Step 3: Extract text for missing items
        $ python kb_import.py extract-pdf GB16423-2020.pdf
        → { success: true, text: "...", char_count: 80357 }

Step 4: Create manifest.json with all items

Step 5: Execute import
        $ python kb_import.py import --json manifest.json
        → { imported: 3, updated: 2, skipped: 0 }

Step 6: Validate
        $ python kb_import.py validate 94
        → { is_valid: true, issues: [] }

Step 7: Verify
        $ python kb_query.py check GB16423 AQ2033 AQ2034 "国发[2010]23号"
        → All found ✓

Relationship with Other Skills

SkillRole
safety-kb-queryQuery/read operations; must use BEFORE import for gap detection
safety-kb-import (this one)Import/write operations into the database
pdfAdvanced PDF handling (merge, split, watermark) — use for complex PDF prep work
standard-update-coursewareUpdate courseware after standards change — uses both query & import

Known Limitations

  • No rollback: Import commits immediately. Validate before importing bulk data.
  • OCR dependency: Scanned PDF handling requires external tesseract installation.
  • Clause granularity: Split patterns are heuristic-based; review output for edge cases.
  • Single-user: No locking mechanism for concurrent access.

Troubleshooting

ErrorCauseSolution
Database not foundWrong pathSet KB_PATH env var or update DEFAULT_DB_PATH
no such column: XSchema changedRun schema command to verify columns
UNIQUE constraint failedDuplicate insert attemptTool should handle updates; check manifest has unique doc numbers
clause_count: 0 after importText empty or pattern mismatchedTry different clause_split_pattern; verify full_text field isn't empty
Garbled Chinese in outputEncoding issueEnsure script runs with UTF-8 locale; Windows: chcp 65001

Version History

  • 1.0.0 (2026-04-25): Initial release with import, extract-pdf, split-clauses, validate, schema commands

Comments

Loading comments...