# PDF Libraries Reference This file provides quick reference for the Python libraries used in pdf-tools skill. ## pdfplumber **Installation:** `pip3 install pdfplumber` **Best for:** Text extraction, table extraction, page layout analysis **Key features:** - Extract text with precise positioning - Extract tables automatically - Get page dimensions and metadata - Access visual elements (lines, rectangles, curves) **Basic usage:** ```python import pdfplumber with pdfplumber.open('file.pdf') as pdf: # Get total pages total_pages = len(pdf.pages) # Extract text from a page page = pdf.pages[0] text = page.extract_text() # Extract tables tables = page.extract_tables() # Get page dimensions width = page.width height = page.height ``` ## PyPDF2 **Installation:** `pip3 install PyPDF2` **Best for:** PDF manipulation (merge, split, rotate, watermark) **Key features:** - Merge multiple PDFs - Split PDFs into separate files - Rotate pages - Add watermarks - Encrypt/decrypt PDFs **Basic usage:** ```python from PyPDF2 import PdfReader, PdfWriter, PdfMerger # Reading reader = PdfReader('input.pdf') pages = reader.pages metadata = reader.metadata # Writing writer = PdfWriter() writer.add_page(reader.pages[0]) with open('output.pdf', 'wb') as f: writer.write(f) # Merging merger = PdfMerger() merger.append('file1.pdf') merger.append('file2.pdf') merger.write('merged.pdf') merger.close() # Rotating page = reader.pages[0] page.rotate(90) # clockwise 90 degrees ``` ## ReportLab (optional, for creating PDFs from scratch) **Installation:** `pip3 install reportlab` **Best for:** Generating new PDFs programmatically **Basic usage:** ```python from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Hello World") c.save() ``` ## Common Patterns ### Extract text from specific pages ```python import pdfplumber with pdfplumber.open('document.pdf') as pdf: for page_num in [0, 2, 4]: # 0-indexed page = pdf.pages[page_num] text = page.extract_text() print(f"Page {page_num + 1}:\n{text}") ``` ### Merge PDFs with page ranges ```python from PyPDF2 import PdfMerger merger = PdfMerger() merger.append('file1.pdf', pages=(0, 3)) # First 3 pages merger.append('file2.pdf') # All pages merger.write('output.pdf') merger.close() ``` ### Extract and save all pages as separate PDFs ```python from PyPDF2 import PdfReader, PdfWriter reader = PdfReader('input.pdf') for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f'page_{i+1}.pdf', 'wb') as f: writer.write(f) ``` ## Text Editing ### Add text overlay to a page ```python from PyPDF2 import PdfReader, PdfWriter from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from io import BytesIO # Create overlay packet = BytesIO() can = canvas.Canvas(packet, pagesize=letter) can.setFont("Helvetica", 12) can.drawString(100, 700, "New Text Here") can.save() # Merge with existing PDF packet.seek(0) overlay = PdfReader(packet) reader = PdfReader('input.pdf') writer = PdfWriter() page = reader.pages[0] page.merge_page(overlay.pages[0]) writer.add_page(page) with open('output.pdf', 'wb') as f: writer.write(f) ``` ### Simple text replacement (limited) ```python # Note: PDF format is complex, this works for simple cases only with open('input.pdf', 'rb') as f: pdf_data = f.read() # Replace at byte level modified = pdf_data.replace(b'Old Text', b'New Text') with open('output.pdf', 'wb') as f: f.write(modified) ``` **Important:** Direct text replacement in PDFs is unreliable due to PDF format complexity (text encoding, positioning, fonts). For reliable editing: 1. Extract text → Edit → Regenerate PDF, or 2. Use overlay method to add new text on top