Information Extraction

v1.0.0

Extract structured information from unstructured text through a semi-automatic pipeline. Support entity extraction, relation extraction, attribute extraction...

⭐ 0· 133·0 current·0 all-time

by@quqxui

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for quqxui/information-extraction.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Information Extraction" (quqxui/information-extraction) from ClawHub.
Skill page: https://clawhub.ai/quqxui/information-extraction
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install information-extraction

ClawHub CLI

Package manager switcher

npx clawhub@latest install information-extraction

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

Name, description, and included scripts align with an information-extraction pipeline: extract.py, normalize.py, and export_triples.py implement extraction, normalization, and export. The heuristics are simple and consistent with a scaffold rather than a full production extractor. However, the pipeline's data contract is inconsistent: extract.py does not include a top-level "relations" key in its output even though normalization expects one, which will cause relations to be lost when following the documented workflow.

Instruction Scope

SKILL.md instructs running extract.py -> normalize.py -> export_triples.py, but extract.py's JSON output omits a 'relations' field (it returns triples, entities, attributes, events, ambiguities). normalize.py expects data.get('relations', []) and will therefore receive an empty list — so relations discovered by extract.py will not be preserved through normalization. Also, the usage examples reference a path (skills/information-extraction/scripts/...) while the repository layout shows scripts/..., which may cause confusion depending on installation layout. Aside from these mismatches, the instructions do not attempt to read unrelated system files, environment variables, or contact external endpoints.

✓

Install Mechanism

This is an instruction-only skill with included Python scripts and no install spec. Nothing is downloaded from external URLs and no packages are installed by the skill itself, so filesystem and network risks from installation are minimal. The scripts use only the standard library.

✓

Credentials

No environment variables, credentials, or config paths are requested. The scripts operate on local input text and local files only; there is no network or secret access.

✓

Persistence & Privilege

The skill does not request always:true and does not modify system or other skills' configuration. It is user-invocable and may be invoked autonomously by the agent (platform default), which is expected for skills. There is no evidence of persistent privilege escalation.

What to consider before installing

This skill appears to implement a simple semi-automatic IE pipeline and contains only local Python scripts (no network calls or secrets). However, there are a few things to check before using it on important data: - Bug to fix: run the extractor once and open the produced JSON. If you do not see a top-level "relations" key, relations discovered by the extractor will be lost by normalize.py. Either add 'relations' to the extractor's output or modify normalize.py to read relations from the extractor output. - Path note: the SKILL.md usage examples use 'skills/information-extraction/scripts/...' while the files live under 'scripts/...'; ensure the runtime path matches where the skill is installed. - Quality caution: the scripts use simple regex heuristics and low default confidences; expect false positives/negatives. Always manually review outputs (the documentation already recommends this). - Safety: there is no network or secret access in the code, so the immediate exfiltration risk is low. Still, run the code on non-sensitive sample data first and inspect outputs. If you plan to integrate this into automated pipelines, patch the relations omission and consider improving extraction logic and confidence handling before processing high-stakes documents.

Like a lobster shell, security has layers — review code before you run it.

latestvk977yr19pzj6m6tesjc8cq50x583dcxe

133downloads

0stars

1versions

Updated 1mo ago

v1.0.0

MIT-0

Information Extraction

Extract entity, relation, attribute, and event information from text into a normalized intermediate structure, then export triples in JSON, JSONL, or TSV.

Core workflow

Define extraction scope and output granularity.
Segment input text into sentences and paragraphs.
Extract entities with evidence.
Extract relations, attributes, and events.
Normalize aliases, predicates, and duplicated records.
Export triples. Default output is JSON.
Review ambiguities before treating output as final.

Input scope

Prefer this skill for:

Plain text strings
Markdown text
Text copied from webpages, notes, reports, transcripts, or documents

If the user provides a file in another format, convert it to text first, then use this skill.

Output contract

Default output should contain:

{
  "triples": [],
  "entities": [],
  "attributes": [],
  "events": [],
  "ambiguities": []
}

Support export formats:

JSON (default)
JSONL
TSV

Extraction principles

Extract explicit facts before inference.
Preserve evidence spans for important records.
Prefer controlled predicates from references/relation-taxonomy.md.
Keep attributes and events separate internally, even when final output is triples.
Do not flatten complex events too early.
Normalize before exporting.
Record unresolved ambiguity instead of pretending certainty.

Minimal internal schema

Use these record shapes during extraction.

Entity

{
  "id": "ent_001",
  "mention": "OpenAI",
  "canonical_name": "OpenAI",
  "type": "Organization",
  "evidence": "OpenAI published the GPT-4 Technical Report.",
  "confidence": 0.95
}

Relation

{
  "subject": "ent_001",
  "predicate": "published",
  "object": "ent_002",
  "evidence": "OpenAI published the GPT-4 Technical Report.",
  "confidence": 0.93
}

Attribute

{
  "entity_id": "ent_002",
  "attribute": "year",
  "value": "2023",
  "evidence": "The report was released in 2023.",
  "confidence": 0.87
}

Event

{
  "id": "ev_001",
  "type": "Publication",
  "trigger": "published",
  "participants": {
    "agent": "ent_001",
    "object": "ent_002"
  },
  "time": "2023",
  "location": null,
  "evidence": "OpenAI published the GPT-4 Technical Report in 2023.",
  "confidence": 0.92
}

How to use references

Read references/pipeline.md for the end-to-end procedure.
Read references/schema.md for types and intermediate record structure.
Read references/relation-taxonomy.md before inventing new predicates.
Read references/triple-mapping.md when exporting final triples.
Read references/event-modeling.md when text describes complex events.
Read references/quality-checklist.md before final delivery.

Scripts

Extract

python3 skills/information-extraction/scripts/extract.py --text "OpenAI published GPT-4." --output out.json

Or read from stdin:

echo "OpenAI published GPT-4." | python3 skills/information-extraction/scripts/extract.py --stdin --output out.json

Normalize

python3 skills/information-extraction/scripts/normalize.py --input out.json --output normalized.json

Export triples

python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format json --output triples.json
python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format jsonl --output triples.jsonl
python3 skills/information-extraction/scripts/export_triples.py --input normalized.json --format tsv --output triples.tsv

Notes on automation

This is a semi-automatic pipeline, not a claim of perfect extraction. The scripts provide scaffolding, normalization, and export. For high-stakes outputs, keep evidence and perform manual review.

Comments

Loading comments...