{"skill":{"slug":"clinical-data-extractor","displayName":"Clinical Data Extractor","summary":"Extract structured clinical trial data from URLs or PDFs including drug name, manufacturer, indication, phase, trial info, efficacy and safety tables, and sa...","description":"---\nname: clinical-data-extractor\ndescription: Extract clinical trial data from pharmaceutical conference websites or PDF documents. Use when user provides a URL or PDF file containing innovative drug clinical trial data and needs structured extraction of: drug name, manufacturer, indication, clinical phase, trial name, conference, efficacy and safety data (presented as tables), and markdown output to \"药品名称@适应症.md\" file.\n---\n\n# Clinical Data Extractor\n\n## Overview\n\nThis skill enables extracting structured clinical trial data from pharmaceutical conference websites (ASCO, ESMO, EHA, etc.) and saving it as a markdown file with standardized format.\n\n## Configuration\n\n**输出路径**: `~/.openclaw/workspace`\n**命名格式**: `{药品名称}@{适应症}.md`\n**文件名清理规则**:\n- 替换空格为下划线\n- 移除特殊字符\n\n**常见终点缩写列表** - 以下缩写在表格中无需写中文全称：\n- ORR (客观缓解率)\n- cORR (确认缓解率)\n- DCR (疾病控制率)\n- mPFS/rPFS (中位无进展生存期)\n- mOS (中位总生存期)\n- mDOR (中位缓解持续时间)\n- PSA50/PSA90 (PSA缓解率)\n- CR (完全缓解)\n- PR (部分缓解)\n- SD (疾病稳定)\n- PD (疾病进展)\n- AE (不良事件)\n\n⚠️ 如需修改配置，请直接编辑本配置区域\n⚠️ 注意：`~` 需在执行时展开为实际用户 home directory\n⚠️ 注意：不在列表中的终点数据应写中文全称以清晰说明\n⚠️ 注意：本技能会尝试自动提取网页图片，但对于受限平台（微信公众号等）需要手动截图\n\n## Requirements\n\nThis skill requires the following tools to be available in the OpenClaw runtime environment:\n\n### Core Dependencies\n- **browser**: Built-in OpenClaw tool for webpage automation and content extraction (no installation required)\n  - Requires: Chrome browser installed on the host system\n  - Usage: `browser action=start profile=openclaw target=host`\n- **read/write**: Built-in OpenClaw tools for file operations (no installation required)\n\n### Browser System Requirements\nFor browser automation to work correctly:\n- **Chrome browser**: Must be installed on the host system (typically at `/usr/bin/google-chrome` or similar path)\n- **Display server**: Desktop environment with X11/Wayland (for non-headless mode) or headless mode support\n- **Network connectivity**: Required for loading webpages\n\n### Optional Dependencies (For PDF Processing)\nThe following tools are used for PDF extraction. The skill will attempt each method in order:\n\n1. **nano-pdf CLI** (recommended)\n   - Installation: Usually pre-installed with OpenClaw\n   - Alternative: If unavailable, file size reduction or OCR may be needed for scanned PDFs\n\n2. **pdftotext** (poppler-utils)\n   - Installation (Debian/Ubuntu): `sudo apt-get install poppler-utils`\n   - Installation (macOS): `brew install poppler`\n   - Used as fallback if nano-pdf is not available\n\n### Filesystem Requirements\n- **Write access to user home directory**: The skill creates markdown files and image files in the configured output path (default: `~/.openclaw/workspace`)\n\n### Configuration Flexibility\nAll configuration options are defined in the **Configuration** section above and can be modified without reinstalling the skill:\n- **输出路径** (Output path)\n- **命名格式** (Filename format)\n- **文件名清理规则** (Filename sanitization rules)\n- **常见终点缩写列表** (Common endpoint abbreviations)\n\n## When to Use\n\nUse this skill when:\n\n1. **User provides a URL** to a pharmaceutical conference website or clinical trial publication (ASCO, ESMO, EHA, WCLC, AACR, etc.) containing clinical trial data\n2. **User provides a PDF file** containing clinical trial data\n3. **User requests to extract structured clinical trial data** from webpages or PDFs\n4. User mentions keywords like \"临床数据\", \"临床试验\", \"clinical data\", \"clinical trial\", or similar requests\n\n**Examples of trigger phrases:**\n- \"提取临床数据\"\n- \"把这份PDF里的临床试验信息整理一下\"\n- \"Extract clinical trial data from this URL/PDF\"\n\n## Workflow\n\n### Step 1: Detect Input Type and Extract Content\n\nDetermine if user provided a **URL** or a **PDF file**.\n\n#### Case A: User provides a URL\n\nUse the **built-in browser** to open and extract page content:\n\n1. **Start browser** (if not already running):\n   ```bash\n   browser action=start profile=openclaw target=host\n   ```\n\n2. **Navigate to URL**:\n   ```bash\n   browser action=open targetUrl=<provided-url>\n   ```\n\n3. **Capture page snapshot** to extract content:\n   ```bash\n   browser action=snapshot format=markdown\n   ```\n\n4. **Optional: Take screenshot** for visual reference:\n   ```bash\n   browser action=screenshot fullPage=true\n   ```\n\n#### Case B: User provides a PDF file\n\nExtract text content from the PDF. Two approaches available:\n\n**Approach 1: Use nano-pdf CLI (read-only)**\n```bash\nnano-pdf --file <path-to-pdf> --action read\n```\n\n**Approach 2: Use nano-pdf with natural language instructions**\n```bash\nnano-pdf --file <path-to-pdf> --action edit --instruction \"Extract all text content from this PDF, focusing on clinical trial data including drug name, indication, phase, efficacy, and safety results\"\n```\n\nNote: The extracted PDF content will be in raw text format. You may need to clean up formatting before proceeding to extraction.\n\n### Step 2: Extract Key Information\n\nAnalyze the fetched content and extract the following fields. Leave blank if information is not available:\n\n1. **药品名称** (Drug Name)\n2. **生产厂家** (Manufacturer)\n3. **适应症** (Indication)\n4. **临床阶段** (Clinical Phase)\n5. **临床名称** (Trial Name)\n6. **学术会议** (Academic Conference)\n7. **药品有效性和安全性** (Efficacy and Safety)\n\n#### Handling Clinical Data Images\n\n**网页图片**：\n- **公开网站（ASCO、ESMO、EHA 等）**：识别网页中直接显示的临床数据图片（如疗效曲线、安全性图表），尝试提取图片 URL 并引用\n  ```markdown\n  ![临床数据描述](图片URL)\n  ```\n- **受限平台（微信公众号等）**：这些平台通常会禁止图片链接的外部访问，无法直接提取图片 URL。在此情况下：\n  - 在文档中添加图片说明，提示用户手动截图\n  - 提供参考图片的描述（如\"疗效数据图\"、\"安全性汇总表\"等）\n  - 如果需要获取原图，建议用户手动截图保存\n\n**图片处理的两种方式**：\n\n**方式一：自动提取（适用于公开网站）**\n```markdown\n![疗效曲线图](https://esmo.org/.../survival_curve.png)\n```\n\n**方式二：手动截图（适用于受限平台或提取失败时）**\n```markdown\n## 临床数据图片\n\n⚠️ 无法自动提取图片（受限平台或提取失败），建议手动截图保存。\n\n参考图片描述：\n1. 疗效数据图（如 rPFS 曲线、OS 曲线）\n2. 安全性汇总表（AE 发生率、严重 AE）\n\n截图保存路径示例：\n```markdown\n![疗效曲线](./药品名称_疗效曲线.png)\n```\n\n**PDF 图片**：\n- 识别 PDF 中直接显示临床数据的图片页面\n- **重要**：PDF 中的图片无法自动提取，需要手动截图保存\n- 使用截图工具保存图片到输出目录（与 markdown 文件同目录）\n- 在 markdown 中用本地路径引用：\n  ```markdown\n  ![临床数据描述](./图片文件名.png)\n  ```\n\nFor effectiveness and safety data, present findings in **markdown table format**:\n\n```markdown\n## 药品有效性和安全性\n\n| 指标 | ABC001 | 对照组 | HR | p-value |\n|------|----------------|--------|------|------|\n| N | 100 | 50 | - | - |\n| ORR | 41.4% | 25.3% | - | <0.0001 |\n| cORR | 34.5% | - | - | <0.0001 |\n| DCR | 87.9% | - | - | <0.0001 |\n| mPFS | 11.3 | 6.8 |  0.62  | <0.0001 |\n| mOS | 22.1 | 14.2 |  0.73  | <0.0001 |\n| 最常见AE | 恶心、血液事件（1-2级） | - | - | - |\n```\n\n**多剂量组示例**：\n```markdown\n| 指标 |  AAB001 2mg | AAB001 4mg | AAB001 6mg |  Placebo |\n|------|----------|--------------|--------------|--------------|\n| N | 50 | 50 | 50 | 50 |\n| OS | 12.1 | 14.2 | 17.3 | 0.2 |\n| OS p-value | <0.0001 | <0.0001 | <0.0001 | - |\n| PFS | 12.1 | 14.2 | 17.3 | 0.2 |\n| PFS p-value | <0.0001 | <0.0001 | <0.0001 | - |\n```\n\n**表格格式规范**：\n- 表格内容第一行必须列出各组入组人数，指标列写\"N\"\n- **关键原则**：确保同一列的数据与该列标题对应的cohort一致\n- **重要规则**：必须明确标注cohort的具体信息（如剂量组、治疗方案等），避免使用\"最大剂量组\"、\"高剂量组\"等笼统表述\n  - ❌ 错误：`AAB001 (最大效果)` 或 `高剂量组`\n  - ✅ 正确：`AAB001 6mg` 或 `对照组`\n- 不同终点可能基于不同分析人群（如总人群 vs 可评估人群），需分别分列\n- 缺乏的数据标注 \"N/A\"，不要将不同人群的数据混用\n- 合并主要终点、次要终点、安全性到一个表格\n- 列名：`[\"指标\", \"实验组1\", \"实验组2\", ...]` 或 `[\"指标\", \"实验组\", \"对照组\"]`（如有对照）\n- 常见终点使用英文缩写（见 **Configuration** 中的\"常见终点缩写列表\"）\n- 不常见的终点写中文全称\n- 不要写95% CI置信区间\n- 时间指标（PFS/OS/DOR等）只写数字，不写单位（如 `11.3` 而非 `11.3个月`）\n- 百分比保留一位小数（如 `41.4%`）\n- 数值不存在的用 `N/A` 或 `NE`（未成熟/未评估）表示\n- 可在数值后用括号标注实际样本量（如 `11.3 (N=82)`）\n\n### Step 3: Save as Markdown File\n\nGenerate output file using the configuration from **Configuration** section:\n\n1. **Filename format**: Follow the **命名格式** from Configuration section\n2. **Sanitize filename**:\n   - Replace spaces with underscores\n   - Remove special characters\n3. **Final save path**: Use the **输出路径** from Configuration section, followed by the generated `{filename}`, then expand `~` to actual home directory.\n\n### Step 4: Generate Expert Commentary (Optional but Recommended)\n\nFrom a medical/pharmaceutical expert perspective, provide a concise analysis of the clinical trial data. This section should be clearly marked as \"（仅供参考）\" (For reference only).\n\nKey aspects to analyze:\n\n1. **Efficacy Evaluation**\n   - Did primary endpoints reach statistical significance?\n   - Are the effect sizes clinically meaningful?\n   - How does it compare to existing therapies in the same indication?\n\n2. **Safety Considerations**\n   - Is the safety profile acceptable?\n   - Any concerning AEs?\n   - How does it compare to the drug class safety profile?\n\n3. **Study Design Assessment**\n   - Is the trial design appropriate?\n   - Is sample size adequate?\n   - Are control groups appropriate?\n   - Any limitations?\n\n4. **Clinical Prospects**\n   - What's the potential for FDA/NMPA approval?\n   - Commercial potential?\n   - What clinical development pathway comes next?\n\n5. **Cautions & Limitations**\n   - Data limitations\n   - What still needs to be validated\n\nProvide concise, objective analysis (3-6 bullet points). Avoid over-optimistic language.\n\n### Step 5: File Content Structure\n\nThe generated markdown file should follow this template:\n\n```markdown\n# {药品名称} - {适应症} 临床数据\n\n## 基本信息\n\n| 字段 | 内容 |\n|------|------|\n| 药品名称 | {药品名称} |\n| 生产厂家 | {生产厂家} |\n| 适应症 | {适应症} |\n| 临床阶段 | {临床阶段} |\n| 临床名称 | {临床名称} |\n| 学术会议 | {学术会议} |\n\n## 药品有效性和安全性\n\n| 指标 | {实验组名称} |\n|------|--------------|\n| 主要终点数据... | 值 |\n| 次要终点数据... | 值 |\n| 安全性数据... | 值 |\n\n## 试验设计\n\n| 设计要素 | 内容 |\n|----------|------|\n| 研究类型 | ... |\n| 入组人数 | ... |\n\n## 临床数据图片\n\n{网页图片链接或PDF截图引用}\n\n## 专家点评\n（仅供参考）\n\n从药学/医学专家角度分析该临床数据的意义：\n\n- **疗效评价**：[分析主要终点结果是否达到临床意义，对比同类药物]\n- **安全性考量**：[分析安全性概况，关注关键AE]\n- **研究设计评价**：[研究设计是否合理、样本量是否充足、对照组选择等]\n- **临床前景**：[基于当前数据评估药物商业化潜力及后续研究方向]\n- **注意事项**：[数据的局限性、需要进一步验证的点等]\n\n## 数据来源\n\n{URL或PDF路径}\n提取时间: {当前日期}\n```\n\n## Tips\n\n- **配置修改**: 输出路径、命名格式、常见缩写列表在 **Configuration** 区域定义，直接编辑即可修改\n- **输出文件位置**: 查看 **Configuration** 区域的 `输出路径` 设置\n- **终点缩写规则**: 只对 **Configuration** 中\"常见终点缩写列表\"内的缩写使用英文，其他终点写中文全称\n- **浏览器使用**: 网页提取使用内置浏览器，启动时指定 `target=host` 参数。如果浏览器未运行，skill 会自动启动\n- Use memory_search to check if similar drugs have been processed before extracting\n- If the content (webpage or PDF) contains multiple drugs or trials, clarify with user which one to extract\n- For complex clinical endpoints, preserve original terminology and units\n- **图片处理注意事项**:\n  - 网页图片：提取原始 URL，在 markdown 中直接引用 `![描述](URL)` 或 `<URL>`（避免大图预览）\n  - PDF 图片：截图保存到与 markdown 同目录，使用相对路径引用 `![描述](./文件名.png)`\n  - 图片命名：使用药品名称+序号，如 `PD-1抑制剂_图表1.png`\n  - 只有临床数据相关的图片需要保存，装饰性图片可以忽略\n- **PDF 处理注意事项**:\n  - PDF 提取的文本格式可能比较混乱，需要适当清理换行和空格\n  - 表格数据在 PDF 中可能无法完整保留，需要根据上下文推断\n  - 如果 PDF 是扫描图片，nano-pdf 可能无法提取文本，需要先 OCR 处理\n  - 对于大型 PDF 文件，可以先使用 `--action read` 快速提取全文内容\n---\n\n**Not every skill requires all three types of resources.**\n","topics":["PDF"],"tags":{"latest":"1.0.4"},"stats":{"comments":0,"downloads":1203,"installsAllTime":45,"installsCurrent":1,"stars":0,"versions":5},"createdAt":1772023145453,"updatedAt":1778993154783},"latestVersion":{"version":"1.0.4","createdAt":1772030895802,"changelog":"**Summary: Updated to use browser-based extraction and clarified system requirements.**\n\n- Replaced web_fetch dependency with browser-based automation for webpage content extraction.\n- Provided step-by-step instructions using the built-in browser tool for URL workflows.\n- Added explicit system and Chrome browser requirements for webpage extraction.\n- Clarified that Chrome must be installed and detailed necessary host environment setup.\n- No changes to core data extraction, PDF, or markdown output logic.","license":null},"metadata":null,"owner":{"handle":"abinww","userId":"s176414bjbdm0wmkqn8smezt0x83g9fr","displayName":"Abin","image":"https://avatars.githubusercontent.com/u/23470117?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1779958910533}}