academic-pdf-redaction

v0.1.0

Redact text from PDF documents for blind review anonymization

0· 76·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for lnj22/paper-anonymizer-academic-pdf-redaction.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "academic-pdf-redaction" (lnj22/paper-anonymizer-academic-pdf-redaction) from ClawHub.
Skill page: https://clawhub.ai/lnj22/paper-anonymizer-academic-pdf-redaction
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install paper-anonymizer-academic-pdf-redaction

ClawHub CLI

Package manager switcher

npx clawhub@latest install paper-anonymizer-academic-pdf-redaction
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (redact PDFs for blind review) matches the instructions: they show how to find patterns in PDF text and apply redactions. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
Instructions remain narrowly scoped to opening, searching, redacting, and verifying PDFs locally. Issues: (1) The top-level rule asks to 'Check that 80%+ of original text remains' but the included verify_redaction() raises only if retention < 70% (inconsistent thresholds). (2) Pattern examples (e.g., '*@*.edu') are presented as wildcard patterns, but the sample PyMuPDF code uses page.search_for(pattern) which matches literal text, not shell-style wildcards—this could cause incorrect behavior if copy/pasted. (3) The heuristic for locating the References page (first page containing 'references') can misidentify sections. These are functional correctness concerns, not scope creep or exfiltration.
Install Mechanism
No install spec and no code files — the skill is instruction-only. It recommends using PyMuPDF but does not attempt to download or execute anything itself, minimizing install risk.
Credentials
The skill requests no environment variables, credentials, or config paths. All operations are local file I/O consistent with its purpose.
Persistence & Privilege
always:false and default invocation settings are used. The skill does not request persistent privileges or modify other skills or system configs.
Assessment
This is a coherent, instruction-only guide for redacting PDFs locally. Before using it: (1) Test on copies of documents — the verify function and thresholds are inconsistent (80% vs 70%) and the 1000-character check may be inappropriate for short papers. (2) Update pattern handling to use proper regular expressions when needed (PyMuPDF's page.search_for() expects literal text; wildcard patterns like '*@*.edu' won't work as written). (3) Be careful with the 'References' detection heuristic (it can misidentify pages). (4) Install PyMuPDF (fitz) from the official package index if you intend to run the code, and review dependency sources. There are no hidden network calls or credential requests in the instructions, but treat this as guidance (not packaged code) and review any code you run locally.

Like a lobster shell, security has layers — review code before you run it.

latestvk97a58x2m98x1xenegqbhjm30s84x9y0
76downloads
0stars
1versions
Updated 1w ago
v0.1.0
MIT-0

PDF Redaction for Blind Review

Redact identifying information from academic papers for blind review.

CRITICAL RULES

  1. PRESERVE References section - Self-citations MUST remain intact
  2. ONLY redact specific text matches - Never redact entire pages/regions
  3. VERIFY output - Check that 80%+ of original text remains

Common Pitfalls to AVOID

# ❌ WRONG - This removes ALL text from the page:
for block in page.get_text("blocks"):
    page.add_redact_annot(fitz.Rect(block[:4]))

# ❌ WRONG - Drawing rectangles over text:
page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0))

# ✅ CORRECT - Only redact specific search matches:
for rect in page.search_for("John Smith"):
    page.add_redact_annot(rect)

Patterns to Redact (Before References Only)

IMPORTANT: Use FULL names/phrases, not partial matches!

  • ✅ "John Smith" (full name)
  • ❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References)
  1. Author names - FULL names only (e.g., "John Smith", not just "Smith")
  2. Affiliations - Universities, companies (e.g., "Duke University")
  3. Email addresses - Pattern: *@*.edu, *@*.com
  4. Venue names - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop")
  5. arXiv identifiers - Pattern: arXiv:XXXX.XXXXX
  6. DOIs - Pattern: 10.XXXX/...
  7. Acknowledgement names - Names in "Acknowledgements" section
  8. Equal contribution footnotes - e.g., "Equal contribution", "* Equal contribution"

PyMuPDF (fitz) - Recommended Approach

import fitz
import os

def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]):
    """Redact specific patterns from PDF using PyMuPDF."""
    doc = fitz.open(input_path)
    original_len = sum(len(p.get_text()) for p in doc)

    # Find References page - stop redacting there
    references_page = None
    for i, page in enumerate(doc):
        if "references" in page.get_text().lower():
            references_page = i
            break

    for page_num, page in enumerate(doc):
        if references_page is not None and page_num >= references_page:
            continue  # Skip References section

        for pattern in patterns:
            # ONLY redact exact search matches
            for rect in page.search_for(pattern):
                page.add_redact_annot(rect, fill=(0, 0, 0))
        page.apply_redactions()

    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    doc.save(output_path)
    doc.close()

    # MUST verify after saving
    verify_redaction(input_path, output_path)

REQUIRED: Verification Function

Always run this after ANY redaction to catch errors early:

import fitz

def verify_redaction(original_path, output_path):
    """Verify redaction didn't corrupt the PDF."""
    orig = fitz.open(original_path)
    redc = fitz.open(output_path)

    orig_len = sum(len(p.get_text()) for p in orig)
    redc_len = sum(len(p.get_text()) for p in redc)

    print(f"Original: {len(orig)} pages, {orig_len} chars")
    print(f"Redacted: {len(redc)} pages, {redc_len} chars")
    print(f"Retained: {redc_len/orig_len:.1%}")

    # DEFENSIVE CHECKS - fail fast if something went wrong
    if len(redc) != len(orig):
        raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}")
    if redc_len < 1000:
        raise ValueError(f"PDF corrupted: only {redc_len} chars remain!")
    if redc_len < orig_len * 0.7:
        raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}")

    orig.close()
    redc.close()
    print("✓ Verification passed")

Comments

Loading comments...