Contract Clause Extractor

Other

Extract & classify key clauses from contract PDFs into a structured risk summary — with bilingual (CN/EN) support.

Install

openclaw skills install @harrylabsj/contract-clause-extractor

Contract Clause Extractor (合同条款提取器)

Turn dense contract PDFs into structured, scannable clause summaries with risk ratings. Extract key clauses across 12 standard categories, flag hidden risks, compare multiple contracts side-by-side, and generate bilingual clause translations — all without replacing legal counsel.

Core Capabilities

  • Multi-format contract ingestion: Parse PDF, DOCX, plain text, or scanned images (OCR) in Chinese and English
  • 12-category clause classification: Auto-classify every clause into standard legal categories with hierarchical numbering awareness
  • Traffic-light risk annotation: 🔴 High risk | 🟡 Medium risk | 🟢 Low risk per clause with explanatory reasoning
  • Hidden risk detection: Flag overly broad indemnities, unilateral termination rights, unreasonable jurisdiction clauses, and missing standard protections
  • Multi-contract comparison: Align and diff clauses across 2+ contracts for quick discrepancy spotting
  • Bilingual extraction: Extract key clauses in CN→EN or EN→CN with terminology preservation
  • Modification suggestion engine: Generate plain-language modification proposals for risky clauses

Workflow (9 Steps)

Step 1: Contract Ingestion

Input: User uploads contract PDF/DOCX, provides URL, or pastes text. Supports single or multiple files for comparison mode. Action: Identify document structure — page layout, clause numbering pattern (1.1 / Article 1 / 第一条), table presence, signature blocks. Output: Parsed document with structural metadata. If scanned/image PDF, trigger OCR pipeline. Logic: Auto-detect language (Chinese, English, or mixed). Handle password-protected PDFs by requesting password.

Step 2: Clause Segmentation

Input: Parsed document. Action: Segment by clause boundaries using numbering patterns, heading styles, and semantic breaks. Preserve parent-child hierarchy for nested clauses. Output: Indexed clause list with numbering + raw text + parent reference. Logic: Handle non-standard numbering (Chinese legal: 一、/(一)/ 1. / (1)). Handle cross-page clause splits.

Step 3: Clause Classification

Input: Segmented clauses. Action: Classify each clause into one of 12 standard categories using LLM semantic matching:

  1. Payment Terms (付款条款) — amounts, schedules, milestones, late fees
  2. Delivery/Performance (交付/履约条款) — scope, timeline, acceptance criteria
  3. Breach & Penalties (违约责任) — liquidated damages, remedies, cure periods
  4. Confidentiality (保密条款) — scope, duration, exclusions, return/destruction
  5. Intellectual Property (知识产权) — ownership, licensing, work-for-hire, background IP
  6. Non-Compete / Non-Solicit (竞业限制) — scope, duration, geographic limits
  7. Jurisdiction & Dispute Resolution (管辖权/争议解决) — governing law, venue, arbitration
  8. Termination (终止条款) — termination for cause, convenience, effects of termination
  9. Force Majeure (不可抗力) — definition, notice, consequences
  10. Liability & Indemnity Caps (赔偿上限) — total liability, damages exclusions, indemnification scope
  11. Acceptance Criteria (验收标准) — testing, UAT, defect remediation
  12. Renewal & Term (续约/期限) — initial term, auto-renewal, notice periods

Output: Clauses grouped by category with confidence scores.

Step 4: Risk Annotation

Input: Classified clauses. Action: Score each clause on risk level:

  • 🔴 High Risk: Unlimited liability, one-sided termination, unreasonable jurisdiction, missing standard protections, IP grab, excessive penalty ratios
  • 🟡 Medium Risk: Ambiguous language, unbalanced but market-standard terms, narrow cure periods, broad force majeure
  • 🟢 Low Risk: Balanced terms, boilerplate with no unusual provisions, standard commercial terms

Output: Each clause tagged with risk level + brief explanation of why.

Step 5: Hidden Risk Detection

Input: Entire contract + risk-annotated clauses. Action: Pattern-based scanning for structural risks:

  • Overly broad indemnification (e.g., "indemnify for any and all claims")
  • One-way termination rights (only one party can terminate for convenience)
  • Unreasonable governing law (e.g., foreign jurisdiction for domestic contract)
  • Missing reciprocal provisions (e.g., one party has confidentiality obligations but not the other)
  • Liquidated damages exceeding legal limits (e.g., >30% of contract value under PRC law)
  • Automatic renewal without notice

Output: "Hidden Risk Alerts" section with specific clause references and severity rating.

Step 6: Clause Summarization

Input: Risk-annotated clauses. Action: Generate a structured extraction table:

#Clause CategoryOriginal Text (excerpt)SummaryRiskModification Suggestion
1Payment"乙方应在收到发票后90日内付款"90-day payment term🟡Negotiate to 30 days standard
2Liability"赔偿上限为合同金额的1倍"Liability cap = 1× contract value🟢Standard protection

Output: Complete extraction table. Option to export as CSV/XLSX.

Step 7: Multi-Contract Comparison (if applicable)

Input: 2+ contracts with their extraction tables. Action: Align clauses by category, then:

  • Identify clauses present in Contract A but missing in Contract B
  • Detect wording differences in matching clauses
  • Flag clauses where risk levels differ between contracts

Output: Side-by-side comparison table with diff highlights.

Step 8: Bilingual Extraction (optional)

Input: Extraction table + target language selection. Action: Translate clause summaries and key terms while preserving legal terminology consistency. Build an ad-hoc bilingual term glossary for the document. Output: Bilingual extraction table (Original → Summary in Target Language). Key terms glossary.

Step 9: Report Generation

Input: All analysis results. Action: Compile into a comprehensive extraction report:

  1. Executive Summary: Contract type, parties, date, overall risk score
  2. Risk Summary: Count of 🔴/🟡/🟢 clauses, top 5 risks
  3. Clause Extraction Table: Full categorized table
  4. Hidden Risk Alerts: Specific warnings
  5. Modification Playbook: Prioritized negotiation recommendations
  6. Export: Markdown (editable), PDF (shareable), JSON (API consumption)

Output: Complete extraction report.

Sample Prompts

Prompt 1: Single Contract Quick Scan

User: "帮我快速提取这份合同的关键条款,标出风险点 [upload: supply-agreement.pdf]" Expected Output:

Executive Summary: Supply Agreement | Parties: Company A vs Company B | Term: 1 year | Overall Risk: 🟡 Medium

Clause Extraction (18 clauses, 12 categories):
🔴 High Risk (2):
  - Indemnity: "甲方承担一切赔偿责任" — Unlimited indemnity, one-sided
  - Termination: "乙方可随时终止合同" — Unilateral termination without cause

🟡 Medium Risk (5):
  - Payment: Net-90 terms, market standard is Net-30
  - Force Majeure: Overly broad definition includes "market conditions"

🟢 Low Risk (11): Standard commercial terms

⚠️ Hidden Risk Alert: No confidentiality clause for Party A (imbalanced)
Top 3 Modification Priorities: 1. Cap indemnity 2. Add mutual termination 3. Shorten payment to Net-30

Prompt 2: Multi-Contract Comparison

User: "对比这两份合同的关键差异 [upload: contract-v1.pdf, contract-v2.pdf]" Expected Output: Side-by-side comparison table with 7 categories showing differences, highlighting where v2 is more/less favorable than v1, with a "verdict" column indicating which version is preferred per category.

Prompt 3: Hidden Risk Deep-Dive

User: "这份30页的服务合同我不敢签,帮我找找有没有坑 [upload: service-agreement.docx]" Expected Output: Hidden risk report focused on 6 structural risk patterns, each with: the offending clause text, why it's problematic, and suggested alternative wording.

Prompt 4: Bilingual Extraction

User: "提取这份中文合同的核心条款,翻译成英文给海外法务看 [upload: nda-zh.pdf]" Expected Output: Bilingual table with Chinese original + English summary for key clauses. Glossary: 保密信息→Confidential Information, 接收方→Receiving Party, etc. Flag terms where translation may create ambiguity.

Prompt 5: Missing Clause Audit

User: "检查这份合同是否缺少了标准商业合同应该有的条款 [upload: vendor-contract.pdf]" Expected Output: Checklist of 12 standard clause categories with ✓/✗ status. For missing categories, explain the risk of omission and suggest a model clause.

Prompt 6: Negotiation Prep

User: "明天要和供应商谈合同,帮我准备谈判要点 [upload: draft-contract.docx]" Expected Output: Prioritized negotiation playbook: Tier 1 (non-negotiable risks → must fix), Tier 2 (market-standard adjustments → push for), Tier 3 (nice-to-have → concede gracefully), with talking points for each.

Real Task Examples

Example 1: Startup Vendor Contract

Scenario: Early-stage startup receives a 15-page SaaS vendor agreement. No in-house legal. Input: Upload PDF of vendor contract. Concern: "作为小公司,会不会被大厂合同坑?" Steps:

  1. Parse → 15 pages, 42 clauses, CN/EN bilingual.
  2. Classify → 12 categories covered, missing Acceptance Criteria.
  3. Risk → 3 🔴: Unlimited liability clause, vendor can change pricing with 7 days notice, data ownership ambiguous.
  4. Hidden risks → Auto-renewal without opt-out notice, vendor indemnity is one-sided.
  5. Generate report with modification suggestions and negotiation talking points. Output: "⚠️ 重点风险: 数据归属条款模糊 —— 你的用户数据可能被供应商使用。建议修改为: 'All Customer Data remains Customer's exclusive property.'" Time: ~30 seconds.

Example 2: Employment Contract Check (Individual)

Scenario: Job seeker receives offer + employment contract. Wants to understand restrictions. Input: "帮我看看这份劳动合同,重点看竞业限制和知识产权条款 [upload: employment-contract.pdf]" Steps:

  1. Focus: Non-compete, IP assignment, termination notice period.
  2. Non-compete: 2 years, all competitors in industry (overly broad under PRC law).
  3. IP: All IP assigned to company, including pre-existing (background IP — 🔴 risk).
  4. Termination: Company may terminate with 30 days notice, employee with 90 days (imbalanced). Output: "竞业限制: 范围过宽,建议限定为直接竞争公司。知识产权: 要求排除入职前已有知识产权。解除通知期: 不对等,建议双方均为30日。"

Example 3: Lease Agreement Quick Check

Scenario: User about to sign a 24-month commercial lease. Input: "租办公室的合同,帮我提取关键信息 [upload: lease-agreement.pdf]" Steps:

  1. Classify: Payment (rent + deposit), Termination (early exit penalty), Renewal, Maintenance obligations.
  2. Risk: Early termination penalty = 6 months rent (🔴), rent escalation 8%/year (🟡), tenant responsible for all repairs including structural (🔴 — unusual, typically landlord responsibility).
  3. Missing: Force majeure clause (risk during pandemic scenarios). Output: Summary with monthly cost projection over 2 years including escalation, highlighted risks with suggested counter-offers.

🚀 First-Success Path (3 Steps)

  1. Step 1: Run contract-clause-extractor.sh classify contract.pdf — parses and extracts all clauses into 12 categories
  2. Step 2: Run contract-clause-extractor.sh risk contract.pdf — annotates each clause with 🔴/🟡/🟢 risk levels
  3. Step 3: Run contract-clause-extractor.sh summarize contract.pdf — see the structured extraction table with modification suggestions

Boundary Conditions

ConditionBehavior
Contract >100 pagesProcess in chunks; summarize by chapter, flag time estimate
Scanned/image PDF (no text layer)Trigger OCR; warn of possible extraction errors
Password-protected PDFRequest password; never attempt to crack
Non-contract document uploadedDetect and warn: "This does not appear to be a legal contract"
Contract in unsupported languageAttempt processing; flag lower confidence for non-CN/EN languages
Handwritten annotations in PDFFlag as "may contain markings" — OCR may miss handwritten text
Corrupted/unreadable PDFError with suggested fixes (re-export, convert format)
Multiple unrelated contracts in one PDFAuto-detect and offer to process separately
User asks for legal adviceRedirect: "This is clause extraction + risk flagging, not legal advice. Consult a qualified lawyer."

Error Handling

Error CodeScenarioHandling
E-PARSE-FAILPDF structure cannot be parsedOffer manual text input; suggest re-exporting PDF from source
E-OCR-FAILOCR on scanned document failsReturn images with note; suggest higher-quality scan
E-PASSWORDPassword-protected PDF without passwordPrompt for password; never attempt brute-force
E-NO-CLAUSESDocument has no detectable clause structureProcess as paragraph-level; flag as "unstructured document"
E-UNSUPPORTED-FORMATUploaded file is not PDF/DOCX/TXTList supported formats; suggest conversion
E-AMBIGUOUS-CLASSIFICATIONClause spans multiple categoriesTag with multiple categories; flag for human review
E-BILINGUAL-CONFIDENCELow confidence on legal term translationMark with ⚠️ "Translation may need legal review"

Security Requirements

  • Document confidentiality: Contract contents processed locally; never sent to external services for storage. Session-only processing.
  • No legal advice claim: This tool extracts and flags; it does NOT provide legal advice, opinions, or recommendations that substitute for qualified counsel. Always include disclaimer.
  • Explicit disclaimer: Every output must include: "⚠️ This is automated clause extraction for reference only. It is NOT legal advice. Consult a qualified lawyer before making contractual decisions."
  • No PII storage: Redact personal identifiers (ID numbers, bank accounts, signatures) from extracted summaries unless explicitly requested.
  • Chinese regulation compliance: Do not extract or store content from contracts involving state secrets, military, or other sensitive sectors.
  • Third-party API warning: If LLM API is called for clause classification, warn user that contract text will be sent to the LLM provider. Offer local-only mode for sensitive contracts.