RAG Pipeline Starter

v1.0.0

Set up and optimize RAG pipelines for large datasets (50K-500K rows) with document chunking, embedding benchmarking, vector indexing, and retrieval tuning.

0· 64·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for abhinas90/rag-pipeline-starter.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "RAG Pipeline Starter" (abhinas90/rag-pipeline-starter) from ClawHub.
Skill page: https://clawhub.ai/abhinas90/rag-pipeline-starter
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install rag-pipeline-starter

ClawHub CLI

Package manager switcher

npx clawhub@latest install rag-pipeline-starter
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match what the code provides: chunking analyzer, embedding benchmark, retrieval tuner, and a simple vector store manager. Required resources (none) and included scripts are proportionate to the stated purpose.
Instruction Scope
SKILL.md instructs running the included Python scripts on local data and creating local indexes. The runtime instructions only reference local files/directories and the included scripts; they do not instruct the agent to read unrelated system files, access credentials, or send data to external endpoints.
Install Mechanism
No install spec is provided; the skill is instruction+code only. This minimizes installation risk — nothing is downloaded or written by an installer. The only runtime requirement is Python 3.8+ and typical Python packages (numpy, sentence-transformers optionally).
Credentials
The skill requests no environment variables or credentials. The code reads and writes local files (chunks, indexes) which is expected for its purpose. There are no references to network endpoints, cloud credentials, or unrelated secrets.
Persistence & Privilege
Skill is not always: true and does not modify other skills or system-wide agent configuration. It persists only to its own index directories/files when run, which is expected behavior for a local vector store manager.
Assessment
What to consider before installing/running: - The package is instruction+code only and runs entirely on local files — there are no network calls or credential requests in the code, which reduces exfiltration risk. - The scripts create and modify files under the directories you pass as --output, --index, or --chunks. Run them in a controlled workspace or sandbox if you are testing, and avoid pointing them at sensitive system directories. - The embedding benchmark is mostly a mock/demo implementation. There is a small bug (function name mismatch: compute_similarity__mock vs. compute_similarity_mock) that may cause runtime errors; expect to edit/fix code if you want production use. The recommend logic also uses the first analyzed document to pick a strategy rather than aggregating across all documents — review if you need different behavior. - If you plan to plug in real (paid) embedding providers, you will need to manage API keys yourself; this skill does not request or manage credentials. Keep keys out of plain text and use secure storage. - Best practice: inspect the files locally (you already have them), run on a small sample dataset first, and run under a restricted environment (container or VM) if you are unsure. Given the available materials, the skill appears internally consistent and implements the features it claims; no indicators of data exfiltration or unrelated privileges were found.

Like a lobster shell, security has layers — review code before you run it.

latestvk972tqh6as4ygbsecc8xnk9esd852cjm
64downloads
0stars
1versions
Updated 1w ago
v1.0.0
MIT-0

RAG Pipeline Starter

Production-grade RAG pipeline setup with chunking strategies, embedding benchmarks, and retrieval tuning for 50K-500K row datasets.

Overview

This skill provides a complete toolkit for building and optimizing RAG (Retrieval-Augmented Generation) pipelines. It analyzes your data, recommends optimal chunking strategies, benchmarks embedding models, and helps tune retrieval parameters for maximum accuracy.

When to Use

  • Building a new RAG system from scratch
  • Optimizing an existing RAG pipeline's retrieval quality
  • Choosing the right embedding model for your domain
  • Processing large document collections (50K-500K rows)
  • Need to balance speed vs. accuracy for your use case

Scripts

chunking_analyzer.py

Analyzes documents and recommends optimal chunking strategies based on content structure.

Usage:

# Assess data and get strategy recommendation
python chunking_analyzer.py --assess ./data

# Apply chunking strategy to documents
python chunking_analyzer.py --strategy recursive --input ./data/doc.txt --output ./chunks/ --chunk-size 500 --overlap 50

Options:

  • --assess <dir> - Analyze documents and recommend strategy
  • --strategy <name> - Chunking strategy: fixed, semantic, recursive, hierarchical
  • --input <path> - Input file or directory
  • --output <dir> - Output directory for chunks
  • --chunk-size <int> - Chunk size (default: 500)
  • --overlap <int> - Overlap between chunks (default: 50)

embedding_benchmark.py

Tests multiple embedding models on your data to find the best fit for your domain.

Usage:

python embedding_benchmark.py --data ./chunks/ --domain finance --output results.json

Options:

  • --embeddings <models> - Embedding models to test (space-separated)
  • --data <dir> - Directory with chunked text files (required)
  • --domain <name> - Domain name for context-specific recommendations
  • --output <file> - Output file for results (JSON)

Supported Embeddings:

  • sentence-transformers/all-MiniLM-L6-v2 (384 dims, fast, free)
  • sentence-transformers/all-mpnet-base-v2 (768 dims, medium, free)
  • openai/text-embedding-ada-002 (1536 dims, fast, paid)
  • cohere/embed-english-v3.0 (1024 dims, fast, paid)
  • bm25 (sparse, fast, free)

retrieval_tuner.py

Optimizes retrieval parameters (top-k, similarity threshold) for your specific use case.

Usage:

python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json --output tuning_results.json

Options:

  • --index <dir> - Vector store index directory
  • --queries <file> - JSON file with test queries and expected results
  • --output <file> - Output file for tuning results
  • --top-k-range <min> <max> - Range of top-k values to test (default: 1 20)
  • --threshold-range <min> <max> <step> - Similarity threshold range

vector_store_manager.py

Manages vector store operations: create, update, search, and maintain indexes.

Usage:

# Create index from chunks
python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding sentence-transformers/all-MiniLM-L6-v2

# Search index
python vector_store_manager.py --search --index ./vector_store/ --query "your search query" --top-k 5

Options:

  • --create - Create new index from chunks
  • --chunks <dir> - Directory with chunked text files
  • --index <dir> - Vector store directory
  • --embedding <model> - Embedding model to use
  • --search - Search existing index
  • --query <text> - Search query
  • --top-k <int> - Number of results to return (default: 5)
  • --update - Update index with new documents
  • --stats - Show index statistics

Pricing Strategy

Free tier (this skill): Core chunking + embedding benchmark tools
Paid guide ($49): Complete production RAG setup with:

  • Multi-modal document processing
  • Hybrid search (dense + sparse)
  • Re-ranking pipeline
  • Evaluation framework
  • Deployment scripts

Workflow

  1. Assess your data

    python chunking_analyzer.py --assess ./your_data/
    
  2. Apply chunking strategy

    python chunking_analyzer.py --strategy recursive --input ./data/ --output ./chunks/
    
  3. Benchmark embeddings

    python embedding_benchmark.py --data ./chunks/ --domain your_domain
    
  4. Create vector store

    python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding <recommended_model>
    
  5. Tune retrieval (optional)

    python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json
    

Requirements

  • Python 3.8+
  • Dependencies: numpy, sentence-transformers (optional for real embeddings)

Files

  • chunking_analyzer.py - Document analysis and chunking
  • embedding_benchmark.py - Embedding model benchmarking
  • retrieval_tuner.py - Retrieval parameter optimization
  • vector_store_manager.py - Vector store operations
  • skill.json - Skill metadata

Comments

Loading comments...