Install
openclaw skills install pdf-text-extractorExtract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
openclaw skills install pdf-text-extractorVernox Utility Skill - Perfect for document digitization.
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
clawhub install pdf-text-extractor
const result = await extractText({
pdfPath: './document.pdf',
options: {
outputFormat: 'text',
ocr: true,
language: 'eng'
}
});
console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);
const results = await extractBatch({
pdfFiles: [
'./document1.pdf',
'./document2.pdf',
'./document3.pdf'
],
options: {
outputFormat: 'json',
ocr: true
}
});
console.log(`Extracted ${results.length} PDFs`);
const result = await extractText({
pdfPath: './scanned-document.pdf',
options: {
ocr: true,
language: 'eng',
ocrQuality: 'high'
}
});
// OCR will be used (scanned document detected)
extractTextExtract text content from a single PDF file.
Parameters:
pdfPath (string, required): Path to PDF fileoptions (object, optional): Extraction options
outputFormat (string): 'text' | 'json' | 'markdown' | 'html'ocr (boolean): Enable OCR for scanned docslanguage (string): OCR language code ('eng', 'spa', 'fra', 'deu')preserveFormatting (boolean): Keep headings/structureminConfidence (number): Minimum OCR confidence score (0-100)Returns:
text (string): Extracted text contentpages (number): Number of pages processedwordCount (number): Total word countcharCount (number): Total character countlanguage (string): Detected languagemetadata (object): PDF metadata (title, author, creation date)method (string): 'text' or 'ocr' (extraction method)extractBatchExtract text from multiple PDF files at once.
Parameters:
pdfFiles (array, required): Array of PDF file pathsoptions (object, optional): Same as extractTextReturns:
results (array): Array of extraction resultstotalPages (number): Total pages across all PDFssuccessCount (number): Successfully extractedfailureCount (number): Failed extractionserrors (array): Error details for failurescountWordsCount words in extracted text.
Parameters:
text (string, required): Text to countoptions (object, optional):
minWordLength (number): Minimum characters per word (default: 3)excludeNumbers (boolean): Don't count numbers as wordscountByPage (boolean): Return word count per pageReturns:
wordCount (number): Total word countcharCount (number): Total character countpageCounts (array): Word count per pageaverageWordsPerPage (number): Average words per pagedetectLanguageDetect the language of extracted text.
Parameters:
text (string, required): Text to analyzeminConfidence (number): Minimum confidence for detectionReturns:
language (string): Detected language codelanguageName (string): Full language nameconfidence (number): Confidence score (0-100)config.json:{
"ocr": {
"enabled": true,
"defaultLanguage": "eng",
"quality": "medium",
"languages": ["eng", "spa", "fra", "deu"]
},
"output": {
"defaultFormat": "text",
"preserveFormatting": true,
"includeMetadata": true
},
"batch": {
"maxConcurrent": 3,
"timeoutSeconds": 30
}
}
const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."
const contract = await extractText('./scanned-contract.pdf', {
ocr: true,
language: 'eng',
ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."
const docs = await extractBatch([
'./doc1.pdf',
'./doc2.pdf',
'./doc3.pdf',
'./doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
MIT
Extract text from PDFs. Fast, accurate, zero dependencies. 🔮