Master Data Matching

Production-ready Master Data Intelligent Matching System. Use when: matching vendor/customer/employee records, deduplicating master data, resolving OCR-extra...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 81 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name, description, SKILL.md, domain schemas, matching-config, and index.js implement an entity-resolution pipeline (exact + semantic matching, OCR→schema mapping, HITL, active learning). Required env vars / binaries are none, which is proportionate for a self-contained JS library.
Instruction Scope
SKILL.md instructs importing the local index.js and calling library functions (getSupportedDomains, runMatchingPipeline, processHumanDecision, etc.). It does not instruct reading unrelated system files or environment variables, nor does it direct data to external endpoints in the provided docs.
Install Mechanism
No install specification is present (instruction-only behavior with bundled code). The package.json is local and there are no external downloads or install scripts; risk from install mechanism is minimal.
Credentials
The skill requests no credentials or environment variables, which is appropriate. Note: it operates on sensitive PII-like fields (tax_id, bank_account, id_number). The config enables active learning persistence (persistenceFile: '.mdm-learning-stats.json'), so the skill will write learning state to disk — consider where that file will be stored and who can access it.
Persistence & Privilege
always:false and normal agent invocation are used. Active learning persistence will create a local file (.mdm-learning-stats.json) and the code may write/read that file; this is expected for an active-learning tool but you should confirm file path/permissions and that it won't overwrite other files.
Assessment
This skill appears to do what it claims: an offline JS library for entity matching with HITL and active learning. Before installing or using it in production: (1) review the full index.js for any network calls, dynamic eval/exec, or telemetry (the provided excerpt shows none, but review the remainder of the file); (2) remember it processes sensitive PII (tax IDs, bank accounts, ID numbers) — run it in a controlled environment and ensure data protection controls are in place; (3) note it will persist learning state to .mdm-learning-stats.json in the working directory — consider configuring storage location and permissions; (4) because the package author and homepage are unknown, prefer to run it in a sandbox, run tests on synthetic data, and (if you plan to use on production data) perform a line-by-line code review or have security/engineering review the repository for hidden network exfiltration or unsafe file operations.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk975sdya31knnj7vdeqtknjnnd836zc1

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Master Data Intelligent Matching System

Overview

A production-ready skill for intelligent entity resolution across business domains. It combines exact-match and vector-semantic retrieval, OCR field mapping with confidence coloring, and human-in-the-loop verification with active learning.

Usage

import mdm from './index.js';

// 1. Get supported domains
mdm.getSupportedDomains(); // ['procurement', 'finance', 'sales', 'hr']

// 2. Build OCR-to-schema mapping with confidence colors
const mapping = mdm.buildOcrSchemaMapping(ocrFields, 'procurement');

// 3. Run full matching pipeline
const result = mdm.runMatchingPipeline(ocrEntity, 'procurement', dbRecords);

// 4. Format result as summary
console.log(mdm.formatMatchingSummary(result));

Key Features

Business Domain Isolation

Four isolated schemas:

  • procurement — vendor records (vendor_name, vendor_code, tax_id, contact, etc.)
  • finance — company records (company_name, registration_number, fiscal_year_end, etc.)
  • sales — customer records (customer_name, customer_code, industry, credit_limit, etc.)
  • hr — employee records (employee_name, employee_id, id_number, department, etc.)

OCR Field to Schema Visual Line Mapping

buildOcrSchemaMapping(ocrFields, domain) maps raw OCR field names to schema fields with confidence colors:

ColorScoreMeaning
🟢 green≥ 0.92High confidence mapping
🟡 yellow0.70–0.92Medium confidence mapping
🔴 red< 0.70Low confidence / unmapped
🔵 bluedb-onlyDatabase field, no OCR data

Dual-Path Entity Retrieval

dualPathEntityRetrieval(entity, domain, dbRecords) runs two parallel paths:

  1. Exact Match (threshold 0.92) — ALL critical fields must match exactly
  2. Vector Semantic (threshold 0.70) — weighted similarity across all fields

Results include needsHumanReview: true if confidence < 0.92 or no match found.

Field Value Verification

verifyFieldValues(ocrEntity, dbRecord, domain) returns 4-state verification per field:

StateMeaning
matchOCR and DB values agree
mismatchValues differ (requires human resolution)
new_infoField only in OCR (new information)
db_onlyField only in DB (not in OCR document)

Human-in-the-Loop

Every pipeline result generates a hitlRequest with:

  • Mismatched fields highlighted
  • New info fields listed
  • Available review actions: confirm_match, reject_match, create_new, update_fields

Use processHumanDecision(decision, state) to process human feedback and generate learning payloads.

Active Learning

updateActiveLearning(payloads, stats) tracks:

  • Per-domain confirmation/rejection/new-record rates
  • Per-field error rates
  • Auto-adjusts thresholds when field error rate > 30%

Example

import mdm from './index.js';

// Sample OCR entity from a vendor invoice
const ocrVendor = {
  vendor_name: 'Acme Corporation Ltd',
  vendor_code: 'V-5001',
  tax_id: '91110000123456789X',
  contact_person: 'John Smith',
  email: 'john.smith@acme.com',
};

// Existing database records
const dbRecords = [
  {
    id: 'rec_001',
    vendor_name: 'Acme Corporation Ltd',
    vendor_code: 'V-5001',
    tax_id: '91110000123456789X',
    contact_person: 'John Smith',
    email: 'j.smith@acme.com',  // slight email mismatch
    phone: '+86-10-12345678',
    address: 'Beijing Chaoyang District',
    bank_account: '6222021234567890',
  },
];

// Run pipeline
const result = mdm.runMatchingPipeline(ocrVendor, 'procurement', dbRecords);
console.log(mdm.formatMatchingSummary(result));

// Process human decision
const decision = { action: 'confirm_match', notes: 'Email mismatch acceptable' };
const { status, learningPayload } = mdm.processHumanDecision(decision, {
  domain: 'procurement',
  ocrEntity: ocrVendor,
  matchResult: result.matchResult,
});

// Update active learning
const newStats = mdm.updateActiveLearning([learningPayload], {});

API Reference

FunctionDescription
getSupportedDomains()List all supported business domains
getDomainSchema(domain)Get field schema for a domain
buildOcrSchemaMapping(ocr, dom)Map OCR fields to schema with confidence
dualPathEntityRetrieval(...)Run exact + semantic matching
verifyFieldValues(...)4-state field verification
runMatchingPipeline(...)Full orchestration pipeline
generateHitlReviewRequest(...)Build human review request payload
processHumanDecision(...)Handle human feedback
updateActiveLearning(...)Update learning stats from decisions
formatMatchingSummary(...)Human-readable result summary

Files

6 total
Select a file
Select a file to preview.

Comments

Loading comments…