RAG Pipeline Starter

v1.0.0

Set up and optimize RAG pipelines for large datasets (50K-500K rows) with document chunking, embedding benchmarking, vector indexing, and retrieval tuning.

⭐ 0· 64·0 current·0 all-time

by@abhinas90

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for abhinas90/rag-pipeline-starter.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "RAG Pipeline Starter" (abhinas90/rag-pipeline-starter) from ClawHub.
Skill page: https://clawhub.ai/abhinas90/rag-pipeline-starter
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install rag-pipeline-starter

ClawHub CLI

Package manager switcher

npx clawhub@latest install rag-pipeline-starter

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description match what the code provides: chunking analyzer, embedding benchmark, retrieval tuner, and a simple vector store manager. Required resources (none) and included scripts are proportionate to the stated purpose.

✓

Instruction Scope

SKILL.md instructs running the included Python scripts on local data and creating local indexes. The runtime instructions only reference local files/directories and the included scripts; they do not instruct the agent to read unrelated system files, access credentials, or send data to external endpoints.

✓

Install Mechanism

No install spec is provided; the skill is instruction+code only. This minimizes installation risk — nothing is downloaded or written by an installer. The only runtime requirement is Python 3.8+ and typical Python packages (numpy, sentence-transformers optionally).

✓

Credentials

The skill requests no environment variables or credentials. The code reads and writes local files (chunks, indexes) which is expected for its purpose. There are no references to network endpoints, cloud credentials, or unrelated secrets.

✓

Persistence & Privilege

Skill is not always: true and does not modify other skills or system-wide agent configuration. It persists only to its own index directories/files when run, which is expected behavior for a local vector store manager.

Assessment

What to consider before installing/running: - The package is instruction+code only and runs entirely on local files — there are no network calls or credential requests in the code, which reduces exfiltration risk. - The scripts create and modify files under the directories you pass as --output, --index, or --chunks. Run them in a controlled workspace or sandbox if you are testing, and avoid pointing them at sensitive system directories. - The embedding benchmark is mostly a mock/demo implementation. There is a small bug (function name mismatch: compute_similarity__mock vs. compute_similarity_mock) that may cause runtime errors; expect to edit/fix code if you want production use. The recommend logic also uses the first analyzed document to pick a strategy rather than aggregating across all documents — review if you need different behavior. - If you plan to plug in real (paid) embedding providers, you will need to manage API keys yourself; this skill does not request or manage credentials. Keep keys out of plain text and use secure storage. - Best practice: inspect the files locally (you already have them), run on a small sample dataset first, and run under a restricted environment (container or VM) if you are unsure. Given the available materials, the skill appears internally consistent and implements the features it claims; no indicators of data exfiltration or unrelated privileges were found.

Like a lobster shell, security has layers — review code before you run it.

latestvk972tqh6as4ygbsecc8xnk9esd852cjm

64downloads

0stars

1versions

Updated 1w ago

v1.0.0

MIT-0

RAG Pipeline Starter

Production-grade RAG pipeline setup with chunking strategies, embedding benchmarks, and retrieval tuning for 50K-500K row datasets.

Overview

This skill provides a complete toolkit for building and optimizing RAG (Retrieval-Augmented Generation) pipelines. It analyzes your data, recommends optimal chunking strategies, benchmarks embedding models, and helps tune retrieval parameters for maximum accuracy.

When to Use

Building a new RAG system from scratch
Optimizing an existing RAG pipeline's retrieval quality
Choosing the right embedding model for your domain
Processing large document collections (50K-500K rows)
Need to balance speed vs. accuracy for your use case

Scripts

chunking_analyzer.py

Analyzes documents and recommends optimal chunking strategies based on content structure.

Usage:

# Assess data and get strategy recommendation
python chunking_analyzer.py --assess ./data

# Apply chunking strategy to documents
python chunking_analyzer.py --strategy recursive --input ./data/doc.txt --output ./chunks/ --chunk-size 500 --overlap 50

Options:

--assess <dir> - Analyze documents and recommend strategy
--strategy <name> - Chunking strategy: fixed, semantic, recursive, hierarchical
--input <path> - Input file or directory
--output <dir> - Output directory for chunks
--chunk-size <int> - Chunk size (default: 500)
--overlap <int> - Overlap between chunks (default: 50)

embedding_benchmark.py

Tests multiple embedding models on your data to find the best fit for your domain.

Usage:

python embedding_benchmark.py --data ./chunks/ --domain finance --output results.json

Options:

--embeddings <models> - Embedding models to test (space-separated)
--data <dir> - Directory with chunked text files (required)
--domain <name> - Domain name for context-specific recommendations
--output <file> - Output file for results (JSON)

Supported Embeddings:

sentence-transformers/all-MiniLM-L6-v2 (384 dims, fast, free)
sentence-transformers/all-mpnet-base-v2 (768 dims, medium, free)
openai/text-embedding-ada-002 (1536 dims, fast, paid)
cohere/embed-english-v3.0 (1024 dims, fast, paid)
bm25 (sparse, fast, free)

retrieval_tuner.py

Optimizes retrieval parameters (top-k, similarity threshold) for your specific use case.

Usage:

python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json --output tuning_results.json

Options:

--index <dir> - Vector store index directory
--queries <file> - JSON file with test queries and expected results
--output <file> - Output file for tuning results
--top-k-range <min> <max> - Range of top-k values to test (default: 1 20)
--threshold-range <min> <max> <step> - Similarity threshold range

vector_store_manager.py

Manages vector store operations: create, update, search, and maintain indexes.

Usage:

# Create index from chunks
python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding sentence-transformers/all-MiniLM-L6-v2

# Search index
python vector_store_manager.py --search --index ./vector_store/ --query "your search query" --top-k 5

Options:

--create - Create new index from chunks
--chunks <dir> - Directory with chunked text files
--index <dir> - Vector store directory
--embedding <model> - Embedding model to use
--search - Search existing index
--query <text> - Search query
--top-k <int> - Number of results to return (default: 5)
--update - Update index with new documents
--stats - Show index statistics

Pricing Strategy

Free tier (this skill): Core chunking + embedding benchmark tools
Paid guide ($49): Complete production RAG setup with:

Multi-modal document processing
Hybrid search (dense + sparse)
Re-ranking pipeline
Evaluation framework
Deployment scripts

Workflow

Assess your data

python chunking_analyzer.py --assess ./your_data/

Apply chunking strategy

python chunking_analyzer.py --strategy recursive --input ./data/ --output ./chunks/

Benchmark embeddings

python embedding_benchmark.py --data ./chunks/ --domain your_domain

Create vector store

python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding <recommended_model>

Tune retrieval (optional)

python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json

Requirements

Python 3.8+
Dependencies: numpy, sentence-transformers (optional for real embeddings)

Files

chunking_analyzer.py - Document analysis and chunking
embedding_benchmark.py - Embedding model benchmarking
retrieval_tuner.py - Retrieval parameter optimization
vector_store_manager.py - Vector store operations
skill.json - Skill metadata

Comments

Loading comments...