Install
openclaw skills install sci-data-extractorAI-powered tool for extracting structured data from scientific literature PDFs
openclaw skills install sci-data-extractorYou are a professional scientific literature data extraction assistant, helping users extract structured data from scientific paper PDFs.
Install Python dependencies (choose one method):
Method 1: Using uv (Recommended - Fastest)
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
cd /path/to/sci-data-extractor
uv venv
source .venv/bin/activate # Linux/macOS
# or .venv\Scripts\activate # Windows
uv pip install -r requirements.txt
Method 2: Using conda (Best for scientific/research users)
cd /path/to/sci-data-extractor
conda create -n sci-data-extractor python=3.11 -y
conda activate sci-data-extractor
pip install -r requirements.txt
Method 3: Using pip directly (Built-in, no extra installation)
cd /path/to/sci-data-extractor
pip install -r requirements.txt
Configure API credentials:
# Copy example configuration
cp .env.example .env
# Edit .env and add your API key
# Get API key from: https://console.anthropic.com/
EXTRACTOR_API_KEY=your-api-key-here
EXTRACTOR_BASE_URL=https://api.anthropic.com
EXTRACTOR_MODEL=claude-sonnet-4-5-20250929
EXTRACTOR_MAX_TOKENS=16384
Optional: Configure Mathpix OCR (for high-precision OCR):
# Get credentials from: https://api.mathpix.com/
MATHPIX_APP_ID=your-mathpix-app-id
MATHPIX_APP_KEY=your-mathpix-app-key
python extractor.py --help
When users request data extraction:
python extractor.py input.pdf --template enzyme -o output.md
Fields: Enzyme, Organism, Substrate, Km, Unit_Km, Kcat, Unit_Kcat, Kcat_Km, Unit_Kcat_Km, Temperature, pH, Mutant, Cosubstrate
Fields: Experiment, Condition, Result, Unit, Standard_Deviation, Sample_Size, p_value
Fields: Author, Year, Journal, Title, DOI, Key_Findings, Methodology
Users should set environment variables (optional, can also be in .env file):
EXTRACTOR_API_KEY: LLM API keyEXTRACTOR_BASE_URL: API endpointEXTRACTOR_MODEL: Model name (default: claude-sonnet-4-5-20250929)EXTRACTOR_TEMPERATURE: Temperature parameter (default: 0.1)EXTRACTOR_MAX_TOKENS: Maximum output tokens (default: 16384)MATHPIX_APP_ID: Mathpix OCR App ID (optional)MATHPIX_APP_KEY: Mathpix OCR Key (optional)Example command for enzyme kinetics extraction:
python extractor.py paper.pdf --template enzyme -o results.md
Example for custom extraction:
python extractor.py paper.pdf -p "Extract all protein structures with PDB IDs" -o custom.md
Example for CSV output:
python extractor.py paper.pdf --template enzyme -o results.csv --format csv