Vector Text Fixer

Other

Fix garbled text in PDF/SVG vector graphics for final editing in AI. Detect, replace and repair garbled text in vector graphic files while maintaining original formatting and layout.

Install

openclaw skills install vector-text-fixer

Vector Text Fixer

Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.

Features

  • Garbled Text Detection: Automatically identifies garbled text in PDF/SVG files
  • Smart Repair: Infers original text content based on context
  • Batch Processing: Supports batch processing of multiple files in a folder
  • Format Preservation: Repaired files maintain original vector format and layout
  • AI-assisted Editing: Outputs intermediate format that can be imported into AI editors

Supported Scenarios

1. PDF Garbled Text Repair

  • Box/question mark issues caused by font embedding problems
  • Garbled text caused by encoding conversion errors
  • Abnormal characters generated by missing font substitution
  • Multi-language mixed encoding issues

2. SVG Garbled Text Repair

  • Text entity encoding errors
  • Special character escaping issues
  • Display abnormalities caused by invalid font references
  • XML encoding declaration errors

Usage

Command Line

# Fix a single PDF file
python scripts/main.py --input document.pdf --output fixed.pdf

# Fix a single SVG file
python scripts/main.py --input diagram.svg --output fixed.svg

# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder

# Interactive repair (manually specify replacement content)
python scripts/main.py --input doc.pdf --interactive

# Export as editable format (JSON)
python scripts/main.py --input doc.pdf --export-json editable.json

Python API

from scripts.main import VectorTextFixer

# Create fixer instance
fixer = VectorTextFixer()

# Fix PDF
result = fixer.fix_pdf("input.pdf", "output.pdf")

# Fix SVG
result = fixer.fix_svg("input.svg", "output.svg")

# Batch processing
results = fixer.batch_fix("./input_folder", "./output_folder")

# Get text map (for AI editing)
text_map = fixer.extract_text_map("input.pdf")

Input Parameters

ParameterTypeRequiredDescription
--inputstrYes*Input file path (PDF or SVG)
--batchstrNoBatch processing input folder
--outputstrYes*Output file/folder path
--interactiveboolNoEnable interactive repair mode
--export-jsonstrNoExport editable JSON format
--encodingstrNoSpecify source file encoding (default: auto-detect)
--font-substitutiondictNoFont replacement mapping
--repair-levelstrNoRepair level: minimal, standard, aggressive (default: standard)

*At least one of --input and --batch is required

Output Format

Repaired PDF/SVG

  • Maintains original vector format
  • Garbled text replaced with readable content
  • Fonts and layout remain unchanged

JSON Export Format

{
  "file_type": "pdf",
  "pages": [
    {
      "page_num": 1,
      "text_blocks": [
        {
          "id": "tb_001",
          "bbox": [100, 200, 300, 220],
          "original_text": "�����",
          "detected_encoding": "UTF-8",
          "confidence": 0.3,
          "suggested_fix": "Sample Text"
        }
      ]
    }
  ],
  "fonts_used": ["Arial", "SimSun"],
  "repair_summary": {
    "total_blocks": 15,
    "fixed_blocks": 12,
    "skipped_blocks": 3
  }
}

Garbled Text Detection Rules

The tool uses the following rules to detect garbled text:

  1. Replacement Character Detection: Identifies U+FFFD (�) and box characters
  2. Control Character Filtering: Excludes non-printing control characters
  3. Encoding Consistency: Detects anomalies caused by mixed encodings
  4. Font Fallback Detection: Identifies substitution characters generated due to missing fonts
  5. Probability Model: Garbled text probability assessment based on character frequency

Repair Strategies

Minimal

  • Only repairs obvious errors (replacement characters, null bytes)
  • Maintains maximum integrity of original text
  • Suitable for minor garbled text issues

Standard

  • Repairs common encoding issues
  • Smart font replacement
  • Balances repair rate and accuracy

Aggressive

  • Comprehensive text re-encoding
  • Uses OCR-assisted recognition
  • Suitable for severely garbled documents

Examples

Fix Single Page PDF

Input:

python scripts/main.py --input report.pdf --output fixed_report.pdf

Output:

✓ Processing: report.pdf
✓ Detected 5 garbled text blocks
✓ Fixed 4 blocks automatically
⚠ 1 block requires manual review
✓ Output saved: fixed_report.pdf
✓ Report saved: fixed_report_repair_log.json

Export Editable JSON

Input:

python scripts/main.py --input diagram.svg --export-json editable.json

Output JSON Structure:

{
  "file_type": "svg",
  "svg_info": {
    "width": 800,
    "height": 600,
    "viewBox": "0 0 800 600"
  },
  "text_elements": [
    {
      "id": "text_1",
      "x": 100,
      "y": 200,
      "font_family": "Arial",
      "font_size": 14,
      "original": "�����",
      "user_editable": "",
      "confidence": 0.25
    }
  ]
}

Dependencies

pdfplumber>=0.10.0      # PDF parsing
PyMuPDF>=1.23.0         # PDF processing (fitz)
cairosvg>=2.7.0         # SVG conversion
beautifulsoup4>=4.12.0  # SVG parsing
fonttools>=4.40.0       # Font processing
chardet>=5.0.0          # Encoding detection
Pillow>=10.0.0          # Image processing

Limitations

  • Encrypted PDFs require password unlock before processing
  • Severely damaged vector files may not be fully repairable
  • Some rare fonts may not map correctly
  • Scanned PDFs require OCR recognition first

Version Information

  • Version: 1.0.0
  • Last Updated: 2026-02-06
  • Status: Ready for use

Risk Assessment

Risk IndicatorAssessmentLevel
Code ExecutionPython/R scripts executed locallyMedium
Network AccessNo external API callsLow
File System AccessRead input files, write output filesMedium
Instruction TamperingStandard prompt guidelinesLow
Data ExposureOutput files saved to workspaceLow

Security Checklist

  • No hardcoded credentials or API keys
  • No unauthorized file system access (../)
  • Output does not expose sensitive information
  • Prompt injection protections in place
  • Input file paths validated (no ../ traversal)
  • Output directory restricted to workspace
  • Script execution in sandboxed environment
  • Error messages sanitized (no stack traces exposed)
  • Dependencies audited

Prerequisites

# Python dependencies
pip install -r requirements.txt

Evaluation Criteria

Success Metrics

  • Successfully executes main functionality
  • Output meets quality standards
  • Handles edge cases gracefully
  • Performance is acceptable

Test Cases

  1. Basic Functionality: Standard input → Expected output
  2. Edge Case: Invalid input → Graceful error handling
  3. Performance: Large dataset → Acceptable processing time

Lifecycle Status

  • Current Stage: Draft
  • Next Review Date: 2026-03-06
  • Known Issues: None
  • Planned Improvements:
    • Performance optimization
    • Additional feature support