Install
openclaw skills install rag-chunking-optimizerOptimize RAG pipeline chunking strategy — analyze documents, recommend chunk sizes, splitting methods, overlap settings, and metadata enrichment for maximum retrieval quality.
openclaw skills install rag-chunking-optimizerAnalyze documents and recommend optimal chunking strategies for RAG (Retrieval-Augmented Generation) pipelines. Evaluates chunk sizes, splitting methods, overlap settings, metadata enrichment, and retrieval quality. Use when building or optimizing RAG applications.
"Optimize chunking for my knowledge base documents"
"What's the best chunk size for these technical docs?"
"Analyze my current chunking strategy for retrieval quality"
"Help me set up semantic chunking for my RAG pipeline"
"Compare chunking strategies for my document types"
Profile the document corpus:
# Analyze document types and sizes
find docs/ -type f \( -name "*.md" -o -name "*.txt" -o -name "*.pdf" -o -name "*.html" \) -exec wc -l {} + | sort -rn | head -20
# Check document structure patterns
for f in docs/*.md; do
echo "=== $f ==="
grep -c "^#" "$f" # heading count
grep -c "^```" "$f" # code block count
wc -w < "$f" # word count
done
Classify documents by type:
Evaluate strategies against the corpus:
Fixed-size chunking:
Recursive character splitting:
\n\n → \n → . → → ``Semantic chunking:
Heading-based (markdown/HTML):
Code-aware chunking:
Sliding window with context:
Factors that determine optimal chunk size:
text-embedding-3-small: 256-512 tokens optimaltext-embedding-ada-002: 512-1024 tokensvoyage-3: 512-1024 tokensBAAI/bge-large: 256-512 tokensDetermine optimal overlap:
Recommend metadata to attach to each chunk:
Evaluate chunking quality:
Design experiments to compare strategies:
Test matrix:
- Chunk sizes: [256, 512, 768, 1024] tokens
- Overlap: [0, 64, 128] tokens
- Method: [recursive, semantic, heading-based]
- Embedding model: [ada-002, voyage-3]
Evaluation set: 50 representative queries with known answers
Metrics: MRR@5, Recall@10, Answer accuracy, Latency
## RAG Chunking Analysis
**Corpus:** 234 documents, 1.2M tokens total
**Current strategy:** Fixed 1024 tokens, 100 overlap
**Recommendation:** Switch to heading-based + semantic hybrid
### Document Profile
| Type | Count | Avg Length | Recommended Strategy |
|------|-------|------------|---------------------|
| API docs | 89 | 2,400 tokens | Heading-based (H2 splits) |
| Tutorials | 45 | 5,800 tokens | Semantic chunking |
| Blog posts | 67 | 1,600 tokens | Recursive, 512 tokens |
| Changelogs | 33 | 800 tokens | Version-based splits |
### Recommended Configuration
```python
# Primary strategy: heading-based for structured docs
structured_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("##", "Section"), ("###", "Subsection")],
strip_headers=False
)
# Fallback: recursive for unstructured content
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "]
)
| Metric | Current | Projected |
|---|---|---|
| MRR@5 | 0.62 | 0.78 (+26%) |
| Recall@10 | 0.71 | 0.89 (+25%) |
| Avg chunk coherence | 0.54 | 0.82 (+52%) |
| Storage overhead | 1.0x | 1.12x |