Install
openclaw skills install data-source-verificationVerify numerical data against original papers and maintain traceable provenance for every value in datasets, tables, and plots. Includes citation source mana...
openclaw skills install data-source-verificationA systematic workflow for verifying that every data point in a research dataset can be traced back to its original source paper, figure, table, or text passage.
Every numerical value must be traceable to a specific location in the original paper. If you cannot find the value in the cited source, it is unverified and must be flagged — never included as confirmed data.
Source PDF → CITATION.md (extracted values) → CSV/data table → LaTeX manuscript
Every link in this chain must be auditable. If someone asks "where did this number come from?", the answer should be: paper X, Table Y, column Z — and we have the PDF to prove it.
init)Create a Citation_Sources/ directory for the project:
Citation_Sources/
AuthorLastName_Year_Journal_ShortTitle/
Author_Year_Topic.pdf ← original paper
Author_Year_Topic_SI.pdf ← supplementary info (if any)
CITATION.md ← structured metadata + data provenance
Every cited paper gets a CITATION.md file:
# Author et al. Year — Short Description
**Title**: Full title
**Authors**: Author list
**Journal**: Journal Vol, Pages (Year)
**DOI**: 10.xxxx/xxxxx
**Data used**: [exact values extracted, with table/figure reference]
**PDF**: ✅ Confirmed | ❌ NOT DOWNLOADED — [reason]
**Status**: CONFIRMED | ⚠️ NEEDS CONFIRM — [reason]
**Notes**: [any caveats, discrepancies, proxy assumptions]
add)When adding a new citation:
Citation_Sources/AuthorLastName_Year_Journal_ShortTitle/When extracting data from a paper, record ALL of the following for each value:
Value: 0.65 W/m·K
Paper: Cheng et al. 2021
DOI: 10.1002/smll.202101693
Location: Table 2, row 3
Method: TDTR (time-domain thermoreflectance)
Data type: Experimental
Verified: YES — value confirmed in Table 2
Never record a value without filling in the Location, Data type, and Verified fields.
For each data point:
If the paper is behind a paywall and you cannot verify:
⚠️ NEEDS CONFIRM — paywallVerify consistency at every step:
Value in PDF → Value in CITATION.md → Value in data table/CSV → Value in manuscript
Any mismatch at any step is a flag.
Mark any value with one of these status levels:
| Status | Meaning | Action |
|---|---|---|
VERIFIED | Found exact value in cited paper at stated location | Include in dataset |
APPROXIMATE | Value is close but not exact (e.g., read from figure) | Include with note |
UNVERIFIED | Cannot find value in cited paper | Flag — do not use without user approval |
MISATTRIBUTED | Cited paper does not contain this data at all | Remove from dataset, alert user immediately |
ESTIMATED | Value was calculated or estimated, not directly measured | Include with clear label |
⚠️ NEEDS CONFIRM | PDF not available (paywall) or value needs double-check | Flag for manual verification |
When multiple sources report different values for the same quantity:
When building compiled datasets, always include provenance columns:
CSV format:
Material,Property,Value,Unit,Source_Paper,DOI,Source_Location,Method,Data_Type,Verified,Notes
Li6PS5Cl,kappa,0.69,W/m·K,Cheng 2021,10.1002/smll.202101693,Table 2,TDTR,experimental,YES,
Li3InCl6,v_longitudinal,2800,m/s,Asano 2018,10.1002/adma.201803075,NOT FOUND,Unknown,unknown,MISATTRIBUTED,Paper contains no Li3InCl6 sound velocity data
JSON format:
{
"material": "Li6PS5Cl",
"property": "thermal_conductivity",
"value": 0.69,
"unit": "W/m·K",
"source": {
"paper": "Cheng et al. 2021",
"doi": "10.1002/smll.202101693",
"location": "Table 2, row 5",
"method": "TDTR",
"dataType": "experimental",
"verified": true
}
}
audit)Scan all CITATION.md files and generate a report:
## Audit Report — [Project Name]
Date: [timestamp]
### Summary
- Total sources: [N]
- PDFs confirmed: [N] / [N]
- Values verified: [N] / [N]
- Needs confirmation: [N]
- Missing PDFs: [N]
### Source Details
| Paper | PDF | Values | Verified | Status |
|---|---|---|---|---|
| Cheng 2021 | ✅ | 3 | 3/3 | CONFIRMED |
| Asano 2018 | ✅ | 2 | 1/2 | ⚠️ 1 MISATTRIBUTED |
| Wang 2014 | ❌ | 4 | 0/4 | ⚠️ NEEDS CONFIRM |
### Flagged Values
- Li3InCl6 v_longitudinal: MISATTRIBUTED to Asano 2018 — paper contains no LIC data
- LGPS density: conflicting values (2.0 vs 1.9 g/cm³) between Wang 2014 and Kamaya 2011
export)Generate a summary table of all data values and their provenance:
## Data Provenance Summary — [Project Name]
| Material | Property | Value | Unit | Source | Location | Data Type | Status |
|---|---|---|---|---|---|---|---|
| LLZTO | κ | 0.42 | W/m·K | Muy 2019 | Table 1 | experimental | VERIFIED |
| LAGP | v_avg | 4700 | m/s | Rohde 2021 | Table S2 | experimental | VERIFIED |
| Li3InCl6 | v_avg | 1849 | m/s | Qiu 2025 | Table 1 | DFT | VERIFIED |
Watch for these indicators of unreliable data: