hybrid-smart-fill

v1.0.0

This skill provides hybrid retrieval (BM25 semantic search + TF-IDF vector similarity) for intelligent template auto-filling. Use when users need to batch fi...

1· 148·0 current·0 all-time
bymaodou13@deweienweide

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for deweienweide/hybrid-smart-fill.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "hybrid-smart-fill" (deweienweide/hybrid-smart-fill) from ClawHub.
Skill page: https://clawhub.ai/deweienweide/hybrid-smart-fill
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install hybrid-smart-fill

ClawHub CLI

Package manager switcher

npx clawhub@latest install hybrid-smart-fill
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description (hybrid retrieval + template filling) match the included code and docs. The Python modules implement BM25/TF-IDF hybrid retrieval and template fill logic; required inputs (knowledge-base JSON, templates) align with the stated purpose.
!
Instruction Scope
SKILL.md instructs running the bundled scripts which is expected, but smart_filler.py contains hard-coded absolute Windows paths (kb_path, template_dir, output_dir) that will be used if the script is executed without editing. Running the script as-is could attempt to read those local paths; the code also performs broad regex extraction/replacement (including a hard-coded 'XX' → '国寿安保基金' replacement) which is domain-specific. There are no instructions to read unrelated system files or external endpoints, but the hard-coded paths are a moderate risk if left unchanged.
Install Mechanism
No install spec; instruction-only plus included Python scripts. No downloaded archives, no external installers, and no package pulls in the skill metadata. Scripts have minimal third-party dependency hints (python-docx, openpyxl) but those must be installed by the user.
Credentials
The skill requests no environment variables or credentials. The code reads only files (knowledge base and template files); there are no hidden env var usages or secrets requests.
Persistence & Privilege
Flags show no always:true and no special privileges. The skill does not modify other skills or global agent configuration and has no automatic installation hooks.
Assessment
This skill appears to do what it claims (local knowledge-base → Word/Excel template filling) and does not request credentials or contact outside endpoints. Before running: (1) inspect and edit scripts/smart_filler.py to set kb_path, template_dir, and output_dir to directories you control (the shipped script has hard-coded Windows paths), (2) review the regex patterns and the hard-coded placeholder replacement ('XX'→'国寿安保基金') to ensure they are appropriate for your data, (3) run in a sandbox or test directory first to verify behavior, (4) install required Python packages (python-docx, openpyxl) in a virtual environment, and (5) ensure your knowledge_base.json does not contain sensitive secrets you don't want processed or written into output files. If you want higher assurance, ask the author to remove hard-coded paths and make configuration explicit (command-line args or config file) or provide a small sanitized example KB and templates to test with.

Like a lobster shell, security has layers — review code before you run it.

latestvk97aady613b61k5fg49khz2gr58357hc
148downloads
1stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Hybrid Smart Fill Skill

This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.

When to Use This Skill

Use this skill when:

  1. Batch Template Filling: Users need to fill multiple Word or Excel templates with data from a knowledge base
  2. High Precision Required: Simple keyword matching is insufficient; semantic understanding is needed for accurate field matching
  3. Knowledge Base Available: A structured knowledge base (JSON format) containing fields and values is available
  4. Complex Field Names: Template fields require semantic matching (e.g., "法人代表" matches "法定代表人")
  5. Placeholder Replacement: Templates contain placeholders like "XX基金" that need to be replaced with actual company names

Common trigger phrases:

  • "填充模板"、"批量填充"、"智能填充"
  • "使用知识库"、"匹配字段"
  • "向量检索"、"语义检索"、"BM25"、"TF-IDF"
  • "自动填写Word/Excel模板"

Core Concepts

Hybrid Retrieval System

This skill uses a hybrid retrieval approach combining two algorithms:

  1. BM25 (Best Matching 25): Statistical ranking function based on term frequency and document frequency

    • Accounts for document length normalization
    • Penalizes overly common terms
    • Scores: IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))
  2. TF-IDF (Term Frequency-Inverse Document Frequency): Vector similarity search

    • Converts text to vector space
    • Calculates cosine similarity between query and documents
    • Semantic matching beyond exact keywords
  3. Hybrid Score: Weighted fusion of both results

    • Formula: final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score
    • Balances precision (BM25) and semantic understanding (TF-IDF)

Matching Strategy

The system uses a multi-level matching strategy:

  1. Exact Match: Field name exactly matches knowledge base key
  2. Containment Match: Field name contains or is contained in knowledge base key
  3. Keyword Match: Multi-keyword combination matching
  4. Special Handling: Auto-replacement of placeholders (e.g., "XX基金" → "国寿安保基金")

How to Use This Skill

Step 1: Prepare Knowledge Base

Ensure the knowledge base is a JSON file with the following structure:

{
  "filename.xlsx": {
    "filename": "filename.xlsx",
    "type": "xlsx",
    "content": "=== Sheet: SheetName\nA1[Header1] | A2[Value1] | ..."
  },
  "filename.docx": {
    "filename": "filename.docx",
    "type": "docx",
    "content": {
      "paragraphs": ["text content..."],
      "tables": [...]
    }
  }
}

Supported formats in JSON:

  • xlsx: Text-based Excel format with A1[Value] | B2[Value] pattern
  • docx: Dictionary or list format containing paragraphs and table data
  • doc: Plain text format

Step 2: Run the Smart Filler

Execute the main filling script:

python scripts/smart_filler.py

The script will:

  1. Load and parse the knowledge base JSON
  2. Extract structured data (89+ typical fields)
  3. Build hybrid retrieval index
  4. Process all template files in the template directory
  5. Fill matched fields and replace placeholders
  6. Save filled files to output directory

Step 3: Review Results

The system generates:

  • Filled templates in the output directory (marked with "已填写" suffix)
  • Fill log showing all field matches and replacements
  • Statistics: Total fields filled, success rate, XX基金 replacement count

Bundled Scripts

scripts/vector_kb.py

Purpose: Core hybrid retrieval engine implementation

Key Classes:

  • BM25Retriever: BM25 ranking algorithm implementation
  • TFIDFRetriever: TF-IDF vector search implementation
  • HybridRetriever: Fusion of both retrieval methods
  • VectorKnowledgeBase: Knowledge base management and indexing

Usage Example:

from vector_kb import VectorKnowledgeBase

# Initialize and load knowledge base
kb = VectorKnowledgeBase()
kb.load_knowledge_base('knowledge_base.json').build_index()

# Search for values
results = kb.search('法人代表', top_k=5)
for result in results:
    print(f"Score: {result['score']}, Value: {result['document']}")

scripts/smart_filler.py

Purpose: Main template filling orchestration

Key Classes:

  • TextExcelParser: Parses text-based Excel content
  • SmartFillSystem: Orchestrates the entire filling process

Usage Example:

from smart_filler import SmartFillSystem

# Configure paths
system = SmartFillSystem(
    kb_path='knowledge_base.json',
    template_dir='templates/',
    output_dir='filled/'
)

# Initialize and process
system.load_kb()
system.process_all()

Configuration:

  • kb_path: Path to knowledge base JSON file
  • template_dir: Directory containing template files
  • output_dir: Directory for filled output files

Reference Documentation

Knowledge Base Format Requirements

Excel Content Format (text-based):

=== Sheet: SheetName ===
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]

Document Content Format (field extraction):

  • Use regex patterns to extract: 字段名[::\s]*值
  • Supported fields: 法人代表, 联系电话, 地址, 注册资本, 统一社会信用代码, etc.

Year-based Data:

  • Automatic organization by year (e.g., "2024年总资产")
  • Cleaned headers (year removed) for better matching

Performance Characteristics

Based on real-world testing:

MetricValue
Knowledge Base Fields89+
Files Processed5+
Total Fields Filled388+
Fields Per File (Average)77.6
XX基金 Replacement Rate100%
Precision Improvement50%+ over keyword matching
Efficiency Gain90%+ over manual filling

Common Issues and Solutions

Issue: Low Match Rate

Cause: Knowledge base content format incompatible

Solution: Ensure Excel content uses A1[Value] format; check JSON structure

Issue: Wrong Value Filled

Cause: Field name ambiguity

Solution: Adjust hybrid retrieval weights; use more specific field names in templates

Issue: Encoding Errors

Cause: Non-UTF-8 characters in knowledge base

Solution: Ensure knowledge base JSON is UTF-8 encoded; use sys.stdout.reconfigure(encoding='utf-8') in scripts

Advanced Usage

Custom Retrieval Weights

Modify the hybrid retrieval weight balance in HybridRetriever:

# Default: BM25 0.5, TF-IDF 0.5
# Change to emphasize semantic matching:
self.bm25_weight = 0.3
self.tfidf_weight = 0.7

Custom Field Extraction

Extend TextExcelParser._extract_from_text() to support additional patterns:

patterns = {
    'new_field': r'新字段[::\s]*([^\n\r]+)',
    # Add more patterns...
}

Batch Processing

Process multiple knowledge bases:

kb_files = ['kb1.json', 'kb2.json', 'kb3.json']
for kb_file in kb_files:
    system = SmartFillSystem(kb_file, 'templates/', f'filled_{kb_file}/')
    system.load_kb()
    system.process_all()

Limitations

  1. No Machine Learning Embeddings: Uses TF-IDF (not BERT/Transformer embeddings) for lightweight deployment
  2. Chinese Tokenization: Simple character-based tokenization (not jieba)
  3. Excel Format: Requires text-based format; binary Excel files need pre-processing
  4. Context Awareness: Limited cell-to-cell context understanding

Future Enhancements

Potential improvements for future versions:

  1. Deep Learning Embeddings: Integrate sentence-transformers for true semantic vectors
  2. Cross-Modal Fusion: Combine table structure information with text matching
  3. Adaptive Weighting: Learn optimal BM25/TF-IDF weights from user feedback
  4. Domain Adaptation: Build domain-specific vocabularies for finance, legal, etc.

References

For deeper understanding:

  • BM25 Algorithm: Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond
  • TF-IDF: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval
  • Hybrid Retrieval: Combining multiple evidence sources in search systems

Comments

Loading comments...