Install
openclaw skills install map-reduce-llmStructured Map-Reduce workflow for analyzing large documents with LLMs when the content exceeds the model's context window. Covers intelligent chunking strategies (including Linux split and Windows PowerShell equivalents), prompt templates, hierarchical and agentic variants, and quality control.
openclaw skills install map-reduce-llmFor initial low-level reconnaissance (e.g., identifying structure, extracting specific patterns, or statistics), combine this pattern with traditional scripting tools (grep, awk, Python single-pass processing) as a preparatory step.
Avoid naive fixed-size splits whenever the document has natural structure.
Before deciding how to chunk a large file, always inspect its type and basic characteristics.
# Basic file type detection (strongly recommended to run first)
file big_document.txt
file -i big_document.txt # Show MIME type + character encoding (e.g. utf-8, iso-8859-1, ascii)
# Brief output
file -b big_document.txt
# File size information
ls -lh big_document.txt # Human-readable size
du -h big_document.txt
wc -c big_document.txt # Total bytes
wc -l big_document.txt # Total lines
wc -w big_document.txt # Total words
# Peek at the beginning to determine if it is text or binary
head -c 200 big_document.txt | cat -A
# Basic file information
Get-Item "big_document.txt" | Select-Object Name, Length, Extension, LastWriteTime
# Show size in MB
(Get-Item "big_document.txt").Length / 1MB
# Line count (efficient for large files)
(Get-Content "big_document.txt" -ReadCount 0).Count
# Character encoding reference (PowerShell does not have a direct equivalent to file -i)
Get-Content "big_document.txt" -TotalCount 5 -Encoding UTF8 # Try different encodings
# Preview the beginning of the file
Get-Content "big_document.txt" -TotalCount 10
# Peek at raw bytes (to determine if it is a text file)
[System.IO.File]::ReadAllBytes("big_document.txt") | Select-Object -First 100 | Format-Hex
Why this matters
# 1. Split by number of lines (fastest and most common)
split -l 1000 -d --additional-suffix=.txt big_document.txt chunk_
# Result: chunk_00.txt, chunk_01.txt, chunk_02.txt ... (1000 lines each)
# 2. Split by approximate byte size
split -b 500K -d big_document.txt chunk_
# 3. Split using a pattern (more semantic, e.g. before every heading)
csplit -z -f section_ -b "%03d.txt" document.txt '/^## /' '{*}'
# Useful reconnaissance commands
wc -l big_document.txt
head -n 100 big_document.txt
tail -n 50 big_document.txt
PowerShell does not have a built-in split command. The following scripts can be used to split large text files.
Simple version (easier to read):
# Split by number of lines (1000 lines per chunk)
$inputFile = "big_document.txt"
$linesPerChunk = 1000
$baseName = "chunk"
$chunkIndex = 0
$currentChunk = @()
Get-Content $inputFile | ForEach-Object {
$currentChunk += $_
if ($currentChunk.Count -eq $linesPerChunk) {
$currentChunk | Set-Content ("{0}_{1:D3}.txt" -f $baseName, $chunkIndex)
$currentChunk = @()
$chunkIndex++
}
}
# Write any remaining lines
if ($currentChunk.Count -gt 0) {
$currentChunk | Set-Content ("{0}_{1:D3}.txt" -f $baseName, $chunkIndex)
}
More efficient version for very large files (uses StreamReader):
$inputFile = "big_document.txt"
$linesPerChunk = 1000
$reader = [System.IO.StreamReader]::new($inputFile)
$chunkIndex = 0
$lineCount = 0
$writer = $null
try {
while (-not $reader.EndOfStream) {
if ($lineCount % $linesPerChunk -eq 0) {
if ($writer) { $writer.Dispose() }
$writer = [System.IO.StreamWriter]::new("chunk_{0:D3}.txt" -f $chunkIndex)
$chunkIndex++
}
$writer.WriteLine($reader.ReadLine())
$lineCount++
}
} finally {
if ($writer) { $writer.Dispose() }
$reader.Dispose()
}
Recommendations
split command is very fast and simple, making it ideal for extremely large files.Store all chunks and metadata in a dedicated run folder.
Each chunk is processed in isolation with a narrowly scoped prompt.
You are an expert analyst. Carefully read the following document chunk.
CHUNK {chunk_id} of {total_chunks}
SOURCE METADATA: {optional_heading_or_location}
TASK:
{specific_map_task}
Example tasks:
- "Create a dense bullet-point summary of the main facts, decisions, arguments, and data points. Include speaker names and approximate timestamps if present."
- "Extract all action items, owners, and deadlines mentioned."
- "List key technical decisions and the rationale provided."
OUTPUT REQUIREMENTS (strict):
- Output ONLY in the requested format below.
- Every point must be self-contained.
- Reference the chunk ID where relevant.
- Do not add any text outside the required format.
DOCUMENT CHUNK:
{chunk_text}
Critical best practice: Force Map outputs to be highly structured (consistent bullet format, JSON, or key-value pairs). This makes the subsequent Reduce step dramatically more reliable and reduces hallucinations.
Save every Map result with its chunk ID for later reference.
This is where the model gains a global view by reading the distilled Map outputs.
You are a senior analyst synthesizing a very large document.
You have been given independent analysis results (Map outputs) from {total_chunks} chunks that together cover the entire source material.
OVERALL OBJECTIVE: {overall_goal_or_question}
INSTRUCTIONS:
1. Carefully review every Map output.
2. Produce a single coherent synthesis.
3. Explicitly call out connections, patterns, contradictions, or themes that appear across multiple chunks.
4. Clearly state when information is missing or conflicting.
5. Follow the exact output format requested.
MAP OUTPUTS:
{all_map_outputs_formatted_with_chunk_ids}
REQUIRED FINAL OUTPUT FORMAT:
{desired_final_structure}
Example structure:
- Executive Summary (200-400 words)
- Key Themes and Insights (bulleted, with cross-chunk references)
- Detailed Findings (organized by topic)
- Open Questions and Limitations
- Traceability: For each major claim, note the chunk IDs that support it
After the Reduce step:
Map → structured Collapse (merge semantically related outputs) → Reduce. This intermediate step helps preserve relationships that would otherwise be lost.
During the Map phase, give agents access to additional tools (search within chunk, code execution, external lookup) so they can enrich the per-chunk analysis.
Always keep intermediate artifacts (chunks, Map outputs, partial Reduces). This makes the process debuggable, resumable, and auditable.
chunk X of Y) so the model can calibrate the scope of its output.| Pitfall | Mitigation |
|---|---|
| Important ideas cut at chunk boundaries | Use overlap + structure-aware chunking |
| Map outputs are too long or inconsistent | Enforce strict length limits and exact formats (bullets or JSON) |
| Reduce step ignores some chunks | Number chunks clearly and instruct the model to consider every Map output |
| Fabricated cross-chunk connections | Require the Reduce output to cite specific chunk IDs for every claim |
| High token cost | Use a cheaper/faster model for the Map phase; reserve the strongest model for Reduce |
| Loss of important nuance | Keep original chunks easily accessible so the final report can support drill-down |
Always produce at minimum:
This pattern is training-free, works with any LLM, scales to documents of essentially arbitrary length (especially when using the sub-agent variant), and produces auditable results.
This workflow is effective whenever you need to perform deep analysis or synthesis on text that does not fit in a single LLM context window.