Paper To Table

v1.0.0

Extract structured data from academic papers (PDF/DOCX/TXT) into literature review tables (XLSX/CSV) with fidelity, batch support, and multi-domain handling.

0· 38· 1 versions· 0 current· 0 all-time· Updated 12h ago· MIT-0

by@2025biophilia-coder

Security Scans

VirusTotalBenign ClawScanBenign Static analysisBenign

Install

openclaw skills install paper-to-table

Paper To Table

Extract structured information from academic papers and populate literature review tables.

Quality Principles

1. Extraction Fidelity (质量约束)

NEVER hallucinate information not present in the paper
NEVER infer beyond what is explicitly stated
Use "N/A" only when a field truly cannot be found after thorough search
Distinguish between:
- Explicitly stated facts → Extract directly
- Implied but not stated → Mark as "N/A" or note as implied
- Absent information → "N/A"
Confidence scoring: Rate each extraction as High/Medium/Low confidence

2. Field Understanding (字段理解)

Understand the semantic meaning of each table header
Map paper content to headers based on meaning, not just keyword matching
Handle synonyms and domain-specific terminology
Recognize implicit information in context

3. Structured Depth (结构化深度)

Extract at appropriate granularity
Distinguish between study-level, experiment-level, and result-level information
Preserve relationships between related fields
Handle multi-experiment papers correctly

4. Batch Stability (批处理稳定性)

Process papers independently (failure of one doesn't affect others)
Log all operations for audit trail
Support resume from interruption
Validate outputs before writing to table

Workflow

Step 1: Identify Inputs

Papers: Single file, multiple files, or folder path → 支持 PDF/DOCX/TXT
Table template: XLSX 或 CSV，含表头定义结构
Language: 自动检测或用户指定
Domain: Psychology / Cognitive Neuroscience / Computer Science / Brain Science / General

Step 2: Read Table Headers

python scripts/read_table.py <table_path>

输出：列名、数据类型约束、领域推断。

Step 3: Extract Paper Content

python scripts/extract_paper.py <paper_path> --structured

自动处理格式：PDF→pdfplumber/PyMuPDF/OCR fallback；DOCX→python-docx；TXT→直接读取。

输出结构化 JSON：包含 full_text 和 sections（abstract/introduction/methods/results/discussion/conclusion）。

Step 4: LLM Extraction (Critical)

原则：只提取论文明确陈述的信息，绝不臆造。

输入：表头 + 论文全文/章节

输出格式（每个字段）：

{
  "FieldName": {
    "value": "extracted value 或 N/A",
    "confidence": "HIGH/MEDIUM/LOW",
    "source": "paper location"
  }
}

CRITICAL RULES：

JSON keys 必须与表头完全匹配（大小写敏感）
缺失信息→"N/A"，不做推断
多值用分号分隔
保留原文语言
LOW confidence 字段需说明原因

提取优先级：Abstract→Methods→Results→Discussion→补充材料

Step 5: Validate & Write

python scripts/write_table.py <table_path> '<json_data>' --validate

验证内容：JSON格式、键名匹配、无重复条目、数据类型合理。

重复检测：标题相似度>85%视为重复，跳过并警告。

Step 6: Report

报告：处理论文数、新增行数、跳过数（重复/错误）、LOW confidence 字段、输出路径。

Batch Processing

python scripts/batch_process.py <papers_folder> <table_path> [output_folder]

独立处理每篇论文（单篇失败不影响其他）
自动生成日志 batch_log_YYYYMMDD_HHMMSS.json
支持断点续传（从日志恢复进度）

详细字段定义、提取策略 → references/extraction-patterns.md 质量检查清单 → references/quality-checklist.md

Domain Specializations

Psychology / Cognitive Neuroscience / Computer Science / Brain Science

Version tags

latestvk977nyk7je6k6hwybt12dfpp0s85tj01