Pubmed Verifier
v2.1.4Batch PubMed citation verifier — detect AI-fabricated references in one click. Five-state verdict: ✅ Correct / ⚠️ Mismatch / 🔶 Partial / ❌ Invalid / ❓ Unkno...
PubMed Citation Verifier v2.1
Five-state batch verification of PMID citations via PubMed E-utilities API, with SQLite caching, CSV support, and Crossref DOI verification.
Verdict Types
| Icon | Verdict | Meaning |
|---|---|---|
| ✅ | Correct | PMID exists AND matches claimed paper (title + author/journal) |
| ⚠️ | Mismatch | PMID exists but points to a different paper (most common AI hallucination!) |
| 🔶 | Partial | Some metadata matches (e.g., author+journal match but title differs) |
| ❌ | Invalid | PMID not found in PubMed |
| ❓ | Unknown | Insufficient claimed metadata for cross-check |
Quick Start
# Verify all PMIDs in a project directory (auto-parses citation context)
python3 scripts/verify_pmids.py --source /path/to/project --output report.html
# Verify specific PMIDs
python3 verify_pmids.py --pmids 31018962,22213727
# Verify with explicit claimed metadata (JSON)
python3 verify_pmids.py --claims '[{"pmid":"34078778","title":"JIA pathogenesis","authors":["Zaripova"],"journal":"Pediatr Rheumatol Online J","year":"2021"}]' --output report.html
# Verify with claims file (JSON or CSV)
python3 verify_pmids.py --claims-file claims.csv --suggest --output report.html
# Verify + DOI cross-check via Crossref
python3 verify_pmids.py --source /path/to/files --verify-doi --output report.html
# Full pipeline with all features
python3 verify_pmids.py --source /path/to/files --verify-doi --suggest --output report.html
What's New in v2.1
| Feature | Description |
|---|---|
| SQLite Cache | Verified PMIDs cached locally at ~/.cache/pubmed-verifier/cache.db. 30-day default expiry. Re-runs on large projects take seconds instead of minutes. |
| CSV Claims | --claims-file now accepts .csv files in addition to JSON. Auto-detects format. Semicolon or pipe-delimited authors supported. |
| Crossref DOI | --verify-doi cross-references article DOIs via Crossref API for extra confidence. |
| Retry Logic | Automatic 3-retry with exponential backoff (1s→2s→4s) on transient API failures. Zero external dependencies. |
| Dual Fuzzy Matching | Title matching uses word-level Jaccard overlap (≥50%) + SequenceMatcher (≥90%) as supplementary. |
How It Works
1. Extract PMIDs + Parse Citation Context
Scans files for common PMID patterns (PMID: 12345678, PubMed URLs, etc.).
Automatically parses surrounding citation text to extract claimed metadata:
- Author surnames (e.g.,
Ravelli A, Martini A→["Ravelli", "Martini"]) - Paper title (between author and journal)
- Journal name (from
<i>...</i>tags or position) - Publication year (
20xx/19xx)
Supported file types: .html, .md, .txt, .json, .htm
2. Fetch PubMed Metadata (with Cache)
Each PMID queried via PubMed esummary API. Results cached in SQLite for 30 days (configurable via --cache-days). Use --no-cache to force fresh queries.
Batch requests (50/call, 0.4s delay, 3-retry with exponential backoff).
3. Cross-Check: Claimed vs Actual
Dual-strategy fuzzy matching:
- Primary: Word-level Jaccard overlap ≥ 50% (handles word reordering, abbreviation expansion)
- Supplementary: SequenceMatcher ratio ≥ 90% (catches edge cases)
| Field | Match Logic |
|---|---|
| Title | Word overlap ≥ 50% OR SequenceMatcher ≥ 90% |
| Authors | ≥1 surname hit for single author claim; ≥2 for multiple |
| Journal | Containment match (handles abbreviations) |
| Year | Exact match |
Verdict determination:
title_match AND (author_match OR journal_match)→ ✅ Correctauthor_match AND journal_match AND NOT title_match→ 🔶 Partial- Otherwise → ⚠️ Mismatch
4. Crossref DOI Verification (Optional, --verify-doi)
For articles with DOIs, queries Crossref API to cross-verify title/journal/year as an independent data source.
5. Auto-Suggest (Optional, --suggest)
For mismatches, searches PubMed using claimed metadata to suggest correct PMIDs (top 3 candidates).
6. Topic Relevance (Optional, --match-keywords)
⚠️ Note: This checks topic relevance only (via filename keywords), NOT PMID correctness. Auxiliary screening tool.
7. Report Output
| Format | Flag | Use case |
|---|---|---|
| HTML | --output report.html | Visual review with claimed vs actual comparison, verdict column |
| JSON | --output report.json | Programmatic processing |
| Text | default (no --output) | Quick terminal review |
Using --claims / --claims-file
When you have explicit claimed metadata (e.g., from AI-generated documents):
JSON array format:
[
{
"pmid": "34078778",
"title": "Juvenile idiopathic arthritis: from pathogenesis to clinical practice",
"authors": ["Zaripova LN", "Midgley A", "Beresford MW"],
"journal": "Pediatr Rheumatol Online J",
"year": "2021"
}
]
CSV format (claims.csv):
pmid,title,authors,journal,year
34078778,JIA pathogenesis,Zaripova LN;Midgley A,Pediatr Rheumatol Online J,2021
31018962,FMF classification criteria,Lidar M|Lancet,,2014
CLI Reference
python3 scripts/verify_pmids.py [OPTIONS]
Options:
--source PATH File or directory to scan for PMIDs
--pmids P1,P2,... Comma-separated PMIDs to verify directly
--claims JSON JSON string with claimed metadata
--claims-file FILE JSON or CSV file with claimed metadata
--verify-doi Also verify DOIs via Crossref
--suggest Auto-suggest correct PMIDs for mismatches
--match-keywords Check topic relevance (auxiliary)
--threshold FLOAT Keyword match threshold (default: 0.2)
--no-cache Disable cache, always query API
--cache-days N Cache validity in days (default: 30)
--output FILE Output file (.json or .html)
--format FORMAT Output format: json|html|text (default: text)
Fixing Mismatched/Invalid PMIDs
- Check the report's "Suggested" column (if using
--suggest) - Use PubMed search to find the correct article:
from scripts.verify_pmids import search_pubmed results = search_pubmed('Ravelli[au] AND juvenile idiopathic arthritis AND Lancet[jour]') for r in results: print(r["pmid"], r.get("title","")) - Verify the replacement and update the source file
API Limits
- No API key required (public E-utilities)
- 3 requests/second without API key
- Script enforces 0.4s delay between batches, 3-retry with backoff
- Batch size: 50 PMIDs per request
- Cache reduces repeated API calls significantly
Performance
| Scenario | First run | Cached run |
|---|---|---|
| 225 PMIDs (MedWiki-Rheum) | ~7 min | ~5 sec |
| Single PMID | ~2s | ~1.4s |
Files
| File | Purpose |
|---|---|
scripts/verify_pmids.py | Main verification script (v2.1, 1058 lines, zero external dependencies) |
references/api_examples.md | PubMed E-utilities API examples |
