Paper To Table

v1.0.0

Extract structured data from academic papers (PDF/DOCX/TXT) into literature review tables (XLSX/CSV) with fidelity, batch support, and multi-domain handling.

0· 38· 1 versions· 0 current· 0 all-time· Updated 12h ago· MIT-0

Install

openclaw skills install paper-to-table

Paper To Table

Extract structured information from academic papers and populate literature review tables.

Quality Principles

1. Extraction Fidelity (质量约束)

  • NEVER hallucinate information not present in the paper
  • NEVER infer beyond what is explicitly stated
  • Use "N/A" only when a field truly cannot be found after thorough search
  • Distinguish between:
    • Explicitly stated facts → Extract directly
    • Implied but not stated → Mark as "N/A" or note as implied
    • Absent information → "N/A"
  • Confidence scoring: Rate each extraction as High/Medium/Low confidence

2. Field Understanding (字段理解)

  • Understand the semantic meaning of each table header
  • Map paper content to headers based on meaning, not just keyword matching
  • Handle synonyms and domain-specific terminology
  • Recognize implicit information in context

3. Structured Depth (结构化深度)

  • Extract at appropriate granularity
  • Distinguish between study-level, experiment-level, and result-level information
  • Preserve relationships between related fields
  • Handle multi-experiment papers correctly

4. Batch Stability (批处理稳定性)

  • Process papers independently (failure of one doesn't affect others)
  • Log all operations for audit trail
  • Support resume from interruption
  • Validate outputs before writing to table

Workflow

Step 1: Identify Inputs

  • Papers: Single file, multiple files, or folder path → 支持 PDF/DOCX/TXT
  • Table template: XLSX 或 CSV,含表头定义结构
  • Language: 自动检测或用户指定
  • Domain: Psychology / Cognitive Neuroscience / Computer Science / Brain Science / General

Step 2: Read Table Headers

python scripts/read_table.py <table_path>

输出:列名、数据类型约束、领域推断。

Step 3: Extract Paper Content

python scripts/extract_paper.py <paper_path> --structured

自动处理格式:PDF→pdfplumber/PyMuPDF/OCR fallback;DOCX→python-docx;TXT→直接读取。

输出结构化 JSON:包含 full_textsections(abstract/introduction/methods/results/discussion/conclusion)。

Step 4: LLM Extraction (Critical)

原则:只提取论文明确陈述的信息,绝不臆造。

输入:表头 + 论文全文/章节

输出格式(每个字段):

{
  "FieldName": {
    "value": "extracted value 或 N/A",
    "confidence": "HIGH/MEDIUM/LOW",
    "source": "paper location"
  }
}

CRITICAL RULES:

  1. JSON keys 必须与表头完全匹配(大小写敏感)
  2. 缺失信息→"N/A",不做推断
  3. 多值用分号分隔
  4. 保留原文语言
  5. LOW confidence 字段需说明原因

提取优先级:Abstract→Methods→Results→Discussion→补充材料

Step 5: Validate & Write

python scripts/write_table.py <table_path> '<json_data>' --validate

验证内容:JSON格式、键名匹配、无重复条目、数据类型合理。

重复检测:标题相似度>85%视为重复,跳过并警告。

Step 6: Report

报告:处理论文数、新增行数、跳过数(重复/错误)、LOW confidence 字段、输出路径。

Batch Processing

python scripts/batch_process.py <papers_folder> <table_path> [output_folder]
  • 独立处理每篇论文(单篇失败不影响其他)
  • 自动生成日志 batch_log_YYYYMMDD_HHMMSS.json
  • 支持断点续传(从日志恢复进度)

详细字段定义、提取策略 → references/extraction-patterns.md 质量检查清单 → references/quality-checklist.md

Domain Specializations

Psychology / Cognitive Neuroscience / Computer Science / Brain Science

Version tags

latestvk977nyk7je6k6hwybt12dfpp0s85tj01