Install
openclaw skills install rag-pipeline-starterSet up and optimize RAG pipelines for large datasets (50K-500K rows) with document chunking, embedding benchmarking, vector indexing, and retrieval tuning.
openclaw skills install rag-pipeline-starterProduction-grade RAG pipeline setup with chunking strategies, embedding benchmarks, and retrieval tuning for 50K-500K row datasets.
This skill provides a complete toolkit for building and optimizing RAG (Retrieval-Augmented Generation) pipelines. It analyzes your data, recommends optimal chunking strategies, benchmarks embedding models, and helps tune retrieval parameters for maximum accuracy.
Analyzes documents and recommends optimal chunking strategies based on content structure.
Usage:
# Assess data and get strategy recommendation
python chunking_analyzer.py --assess ./data
# Apply chunking strategy to documents
python chunking_analyzer.py --strategy recursive --input ./data/doc.txt --output ./chunks/ --chunk-size 500 --overlap 50
Options:
--assess <dir> - Analyze documents and recommend strategy--strategy <name> - Chunking strategy: fixed, semantic, recursive, hierarchical--input <path> - Input file or directory--output <dir> - Output directory for chunks--chunk-size <int> - Chunk size (default: 500)--overlap <int> - Overlap between chunks (default: 50)Tests multiple embedding models on your data to find the best fit for your domain.
Usage:
python embedding_benchmark.py --data ./chunks/ --domain finance --output results.json
Options:
--embeddings <models> - Embedding models to test (space-separated)--data <dir> - Directory with chunked text files (required)--domain <name> - Domain name for context-specific recommendations--output <file> - Output file for results (JSON)Supported Embeddings:
Optimizes retrieval parameters (top-k, similarity threshold) for your specific use case.
Usage:
python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json --output tuning_results.json
Options:
--index <dir> - Vector store index directory--queries <file> - JSON file with test queries and expected results--output <file> - Output file for tuning results--top-k-range <min> <max> - Range of top-k values to test (default: 1 20)--threshold-range <min> <max> <step> - Similarity threshold rangeManages vector store operations: create, update, search, and maintain indexes.
Usage:
# Create index from chunks
python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding sentence-transformers/all-MiniLM-L6-v2
# Search index
python vector_store_manager.py --search --index ./vector_store/ --query "your search query" --top-k 5
Options:
--create - Create new index from chunks--chunks <dir> - Directory with chunked text files--index <dir> - Vector store directory--embedding <model> - Embedding model to use--search - Search existing index--query <text> - Search query--top-k <int> - Number of results to return (default: 5)--update - Update index with new documents--stats - Show index statisticsFree tier (this skill): Core chunking + embedding benchmark tools
Paid guide ($49): Complete production RAG setup with:
Assess your data
python chunking_analyzer.py --assess ./your_data/
Apply chunking strategy
python chunking_analyzer.py --strategy recursive --input ./data/ --output ./chunks/
Benchmark embeddings
python embedding_benchmark.py --data ./chunks/ --domain your_domain
Create vector store
python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding <recommended_model>
Tune retrieval (optional)
python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json
chunking_analyzer.py - Document analysis and chunkingembedding_benchmark.py - Embedding model benchmarkingretrieval_tuner.py - Retrieval parameter optimizationvector_store_manager.py - Vector store operationsskill.json - Skill metadata