Pdf Contract Redactor
PDF contract redaction tool. Use when the user needs to redact sensitive information from scanned PDF contracts. The tool performs OCR to extract text, ident...
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 180 · 0 current installs · 0 all-time installs
bychan@chayjan
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
The name/description match the included script and SKILL.md: the code converts PDF pages to images, calls Alibaba Cloud OCR, matches field names to nearby values, and draws black rectangles over value areas. Requiring Alibaba OCR credentials (provided at runtime) is coherent with the stated purpose.
Instruction Scope
SKILL.md instructions stay within the redaction task and the script follows them. Minor mismatches: SKILL.md says it will 'retry with exponential backoff' on API failures but the script's AliyunOCRClient simply catches exceptions and returns an empty list (no backoff). SKILL.md demonstrates passing credentials as CLI args (and the script expects them) — functionally OK but a security practice concern because command-line args are visible in process lists/shell history.
Install Mechanism
No install spec; the skill is instruction+script only. Declared Python dependencies (pymupdf, pillow, requests) are appropriate and proportional to the task and are standard packages from PyPI. Nothing is downloaded from arbitrary URLs or written to unexpected locations.
Credentials
The only sensitive credentials used are Alibaba AccessKey ID and Secret, which the script legitimately needs to call the OCR API. The registry metadata does not declare env vars but the SKILL.md and script expect the credentials as CLI args — this is coherent but risky (exposes secrets in ps/command history). The script does not require or attempt to read unrelated credentials or system config.
Persistence & Privilege
The skill is not always-enabled, does not modify other skills or system configurations, and writes only local output files (<name>_redacted.pdf and <name>_fields.json). It does not request persistent elevated privileges.
Assessment
This skill appears to do what it says, but consider these practical cautions before using it:
- Credentials: The script expects your Alibaba AccessKey ID and Secret as command-line arguments; passing secrets on the command line can expose them via process listings and shell history. Prefer modifying the script to read credentials from a protected environment variable, a config file with restricted permissions, or a secrets manager.
- Redaction effectiveness: The tool overlays black rectangles on the original PDF pages. If the original PDF contains underlying selectable/searchable text or metadata, that underlying text may remain accessible even after the visual overlay. Verify redaction by attempting to select/copy text from the redacted PDF and consider flattening the PDF or exporting a rasterized final PDF to ensure irreversible removal.
- Error handling: The SKILL.md mentions exponential backoff for OCR failures but the implementation does not implement retries; expect possible dropped pages/text if the API call fails. Test with non-sensitive documents first.
- Privacy: The images are sent to Alibaba's OCR endpoint; only use with documents you are allowed to upload to that external service. If documents are highly sensitive, consider an offline OCR alternative.
- Validation: Run the tool on sample contracts and confirm that the fields you need are matched and redacted correctly (edge cases with layout/coordinates may cause false negatives/positives).
If you need stronger guarantees (no residual text, no external network calls), either modify the tool to use a local OCR engine and to flatten outputs, or withhold highly sensitive documents from being processed by cloud OCR services.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
PDF Contract Redactor
Redact sensitive values from scanned PDF contracts while preserving field names.
What It Does
- OCR Recognition: Uses Alibaba Cloud OCR to extract text and positions from scanned PDFs
- Field-Value Matching: Finds field names (e.g., "合同金额") and their corresponding values (e.g., "45640元")
- Selective Redaction: Covers only the values with black boxes, keeping field names readable
Workflow
Step 1: PDF to Images
Convert PDF pages to high-resolution PNG images (200 DPI) for OCR.
Step 2: OCR with Alibaba Cloud
Call Alibaba Cloud OCR API to get:
- All text blocks
- Bounding box coordinates for each text block
- Confidence scores
Step 3: Match Fields to Values
For each field in the field list:
- Find the field name text block
- Look for the corresponding value in:
- Right side: Same row, to the right of field name
- Below: Next row, aligned with field name
- Record field-value pair with both bounding boxes
Step 4: Generate Redacted PDF
For each matched value:
- Convert image coordinates to PDF coordinates
- Draw black rectangle over the value area
- Keep field name area unchanged
Field List
The following fields are searched and their values are redacted:
- 法务部归档编号, 归档时间, 申请人工号, 申请人姓名, 申请人部门
- 申请人部门负责人, 所涉项目名称(如有), 所涉项目编号(如有)
- 对方编号(如有), 合同编号, 合同名称, 合同甲方名称, 合同乙方名称
- 合同相对方, 相对方所属行业, 相对方是否为世界500强
- 相对方是央企/国企, 相对方是否为涉密单位, 业务类别, 合同类别
- 合同类型, 合同状态, 扫描件状态, 对方是否签章, 我方是否签章
- 销售、采购标的(非一起译填), 语种, 单价, 合同金额(元), 币种
- 支付/收款方式, 付款/收款条件, 合同结算周期, 是否使用公司模板
- 用章主体, 印章类型, 签订时间, 合同开始时间, 合同到期时间
- 收支类型, 我方联系人姓名, 我方联系人电话, 对方联系人姓名
- 对方联系人电话, 对方邮寄地址, 归档状态, 开票名称, 开票账号
- 开票银行, 收款名称, 收款账号, 收款银行, 验收时间, 验收标准
- 合同是否自动续期, 合同续期时间, 合同特殊约定
- 协议内是否有结算单, 结算单(如有)内容是否填写
Usage
Prerequisites
- Alibaba Cloud account with OCR service enabled
- AccessKey ID and AccessKey Secret
Running the Tool
python scripts/redact_contract.py <input.pdf> <access_key_id> <access_key_secret> [output.pdf]
Example:
python scripts/redact_contract.py contract.pdf LTAIxxx xxx contract_redacted.pdf
Output
<name>_redacted.pdf: Redacted PDF with values covered<name>_fields.json: JSON file listing all matched field-value pairs
Implementation Notes
OCR API
Uses Alibaba Cloud "通用文字识别-高精度版" (RecognizeAdvanced API):
- Endpoint:
https://ocr.aliyuncs.com - Returns text content and quadrilateral coordinates
- Supports automatic rotation detection
Field-Value Matching Logic
# For a field at (fx0, fy0, fx1, fy1)
# Look for values that are:
# 1. To the right: vx0 > fx1 and |vy0 - fy0| < field_height * 2
# 2. Below: vy0 > fy1 and vx0 >= fx0 - field_width * 0.3
# Choose the closest match
Coordinate Transformation
OCR returns coordinates in image space (200 DPI).
Convert to PDF space (72 DPI) using scale factor: scale = 72 / 200 = 0.36
Dependencies
pip install pymupdf pillow requests
Error Handling
- If OCR API fails, retry with exponential backoff
- If field not found, skip silently (don't fail entire document)
- If value not found for a field, log warning and continue
Files
2 totalSelect a file
Select a file to preview.
Comments
Loading comments…
