RAG Production Engineering

Prompts

Build, optimize, and operate production-ready Retrieval-Augmented Generation systems with best practices in architecture, chunking, embedding, retrieval, eva...

Install

openclaw skills install afrexai-rag-production

RAG Production Engineering

Complete methodology for building, optimizing, and operating Retrieval-Augmented Generation systems in production. From architecture decisions through chunking strategies, embedding selection, retrieval tuning, evaluation frameworks, and production monitoring.

Quick Health Check

Score your RAG system (1 = poor, 2 = okay):

Signal	What to Check
Retrieval relevance	Top-5 results contain answer >90% of time
Answer accuracy	Generated answers faithful to retrieved context
Latency	End-to-end response <3s (p95)
Chunk quality	Chunks are self-contained, meaningful units
Evaluation coverage	Automated eval suite with 50+ test cases
Index freshness	Documents indexed within SLA of source update
Failure handling	Graceful degradation when retrieval returns nothing
Cost efficiency	Cost per query within budget (<$0.05 typical)

Score: /16 — Below 10 = critical issues. Below 12 = significant gaps. 14+ = production-ready.

Phase 1: Architecture Decision

When to Use RAG (vs Alternatives)

Approach	Use When	Don't Use When
RAG	Dynamic knowledge, source attribution needed, data changes frequently	Static small dataset (<10 pages), real-time data needed
Fine-tuning	Consistent style/format needed, domain-specific language	Frequently changing data, need source citations
Long context	Small corpus (<200K tokens), simple Q&A	Large corpus, cost-sensitive, need precise attribution
RAG + Fine-tuning	Domain-specific language AND dynamic knowledge	Budget-constrained, simple use case
Agentic RAG	Multi-step reasoning, tool use, complex queries	Simple lookup, latency-critical

RAG Architecture Brief

# Fill this out before building
project:
  name: ""
  use_case: ""  # Q&A, search, summarization, analysis, chatbot
  domain: ""    # legal, medical, technical, general

data:
  sources: []        # PDF, web, database, API, markdown, code
  volume: ""         # <1K docs, 1K-100K, 100K-1M, >1M
  update_frequency: "" # real-time, daily, weekly, static
  avg_doc_length: "" # <1 page, 1-10 pages, 10-100 pages, >100 pages
  languages: []

requirements:
  latency_p95: ""    # <1s, <3s, <10s, <30s
  accuracy_target: "" # 85%, 90%, 95%, 99%
  citations_needed: true
  access_control: false
  compliance: []     # GDPR, HIPAA, SOC2, none

budget:
  monthly_queries: ""
  cost_per_query_target: ""
  infra_budget: ""

Architecture Patterns

Basic RAG

Query → Embed → Vector Search → Top-K → LLM → Answer

Best for: Simple Q&A, <100K documents, single data source.

Advanced RAG

Query → Classify → Rewrite → Embed → Hybrid Search → Rerank → Filter → LLM → Answer + Citations

Best for: Production systems, mixed document types, accuracy-critical.

Agentic RAG

Query → Planner → [Search₁, Search₂, SQL, API] → Synthesize → Verify → Answer

Best for: Complex multi-step reasoning, multiple data sources, analytical queries.

Graph RAG

Query → Entity Extract → Graph Traverse → Subgraph → Context Assembly → LLM → Answer

Best for: Relationship-heavy data (org charts, legal references, knowledge bases).

Architecture Decision Tree

Is your corpus < 200K tokens?
  YES → Try long-context first (cheapest, simplest)
  NO → Continue

Do you need source citations?
  YES → RAG (not fine-tuning)
  NO → Consider fine-tuning if style matters

Single data source, simple queries?
  YES → Basic RAG
  NO → Continue

Multi-step reasoning or multiple sources?
  YES → Agentic RAG
  NO → Advanced RAG

Phase 2: Document Processing & Chunking

Document Processing Pipeline

Source → Extract → Clean → Chunk → Enrich → Embed → Index

Extraction by Source Type

Source	Tool/Method	Gotchas
PDF (text)	PyMuPDF, pdfplumber	Tables break, headers repeat per page
PDF (scanned)	Tesseract, AWS Textract, Azure DI	OCR errors in technical terms
HTML/Web	BeautifulSoup, Trafilatura	Nav/footer pollution, JS-rendered content
Markdown	Direct parse	Frontmatter, relative links
Code	Tree-sitter, AST	Preserve structure, handle imports
Word/PPTX	python-docx, python-pptx	Formatting loss, embedded objects
Database	SQL export	Schema context needed
Audio/Video	Whisper → text	Timestamp alignment, speaker diarization

Cleaning Checklist

Remove headers/footers/page numbers
Strip navigation, ads, boilerplate
Normalize whitespace and encoding (UTF-8)
Resolve abbreviations in domain text
Handle tables (convert to structured text or separate)
Preserve code blocks with language tags
Remove duplicate content across documents
Extract and preserve metadata (title, author, date, source URL)

Chunking Strategies

Strategy	Chunk Size	Best For	Weakness
Fixed-size	256-512 tokens	Homogeneous text, fast prototyping	Breaks mid-sentence/thought
Recursive character	256-1024 tokens	General purpose (LangChain default)	May split related paragraphs
Semantic	Variable	High-quality retrieval, mixed content	Slower, needs embedding model
Document structure	Section-based	Well-structured docs (markdown, HTML)	Uneven chunk sizes
Sentence window	3-5 sentences	Precise retrieval, reranking	More chunks to manage
Parent-child	Small retrieve, large context	Best of both worlds	Complex implementation
Agentic	Full section/doc	Complex reasoning	Higher token cost

Chunking Decision Guide

Is your content well-structured (headers, sections)?
  YES → Document structure chunking
  NO → Continue

Is retrieval precision critical (legal, medical)?
  YES → Sentence window + reranking
  NO → Continue

Mixed content types in same corpus?
  YES → Semantic chunking
  NO → Recursive character (start here, optimize later)

Chunking Rules

Always overlap — 10-20% overlap prevents context loss at boundaries
Chunk size matters — Smaller = more precise retrieval, larger = more context. Start at 512 tokens, tune with eval
Preserve structure — Don't break tables, code blocks, or lists mid-element
Include metadata — Every chunk needs: source document, section title, page/position, timestamp
Test with real queries — The "right" chunk size depends on your actual query patterns
Parent-child for production — Retrieve small chunks, expand to parent for LLM context

Chunk Metadata Schema

chunk:
  id: "doc-123-chunk-7"
  text: "..."
  metadata:
    source_id: "doc-123"
    source_title: "Q3 Financial Report"
    source_url: "https://..."
    section_title: "Revenue Analysis"
    page_number: 12
    position: 7           # chunk position in document
    total_chunks: 23
    created_at: "2026-03-24T04:00:00Z"
    updated_at: "2026-03-24T04:00:00Z"
    content_type: "text"  # text, table, code, image_caption
    language: "en"
    # Domain-specific
    access_level: "internal"
    department: "finance"

Phase 3: Embedding Models

Embedding Model Selection

Model	Dimensions	Context	Speed	Quality	Cost
text-embedding-3-small (OpenAI)	1536	8191	Fast	Good	$0.02/1M tokens
text-embedding-3-large (OpenAI)	3072	8191	Medium	Excellent	$0.13/1M tokens
Cohere embed-v4	1024	512	Fast	Excellent	$0.10/1M tokens
Voyage-3	1024	32K	Medium	Excellent	$0.06/1M tokens
BGE-large-en-v1.5 (open)	1024	512	Self-host	Very Good	Free (compute)
GTE-Qwen2 (open)	Various	8192	Self-host	Excellent	Free (compute)
Nomic-embed-text (open)	768	8192	Self-host	Good	Free (compute)

Selection Rules

Start with OpenAI text-embedding-3-small — best cost/quality ratio for most use cases
Upgrade to large/Voyage-3 when eval shows retrieval gaps
Use open models when: data can't leave your infra, cost-sensitive at scale (>10M chunks), need fine-tuning
Match chunk size to model context — don't exceed model's context window
Same model for indexing AND querying — ALWAYS. Mixing models = broken retrieval
Benchmark on YOUR data — MTEB scores don't predict domain-specific performance

Embedding Best Practices

Normalize embeddings before storing (L2 normalization for cosine similarity)
Batch embed documents (not one-by-one) for efficiency
Cache embeddings — re-embedding is expensive and slow
Version your embeddings — when you change models, re-embed everything
Instruction-prefix for asymmetric models — some models need "query: " vs "passage: " prefixes
Dimensionality reduction — text-embedding-3 models support Matryoshka (lower dims for speed, test quality)

Phase 4: Vector Database & Indexing

Vector Database Selection

Database	Type	Scale	Speed	Features	Best For
Pinecone	Managed	Billions	Fast	Metadata filter, namespaces	Production SaaS, zero-ops
Weaviate	Managed/Self	Millions	Fast	Hybrid search, modules	Mixed search needs
Qdrant	Managed/Self	Billions	Very Fast	Payload filters, sparse vectors	Performance-critical
Chroma	Embedded	<1M	Good	Simple API, local	Prototyping, small scale
pgvector	Extension	Millions	Good	SQL, existing Postgres	Already have Postgres
Milvus	Self-hosted	Billions	Fast	GPU support, hybrid	Large-scale self-hosted
LanceDB	Embedded	Millions	Fast	Serverless, multi-modal	Cost-sensitive, serverless

Selection Decision

Prototyping or <100K chunks?
  → Chroma or LanceDB (embedded, no server)

Already running PostgreSQL?
  → pgvector (add extension, done)

Production, want zero-ops?
  → Pinecone or Weaviate Cloud

Need hybrid search (vector + keyword)?
  → Weaviate or Qdrant

>100M vectors, self-hosted?
  → Milvus or Qdrant

Indexing Strategy

Index Type	Speed	Recall	Memory	Use When
HNSW	Very Fast	95-99%	High	Default choice, <10M vectors
IVF-PQ	Fast	90-95%	Low	>10M vectors, memory-constrained
Flat/Brute	Slow	100%	High	<100K vectors, accuracy-critical
ScaNN	Very Fast	95-99%	Medium	Google ecosystem, large scale

Index Configuration Rules

HNSW: M=16, efConstruction=200, efSearch=100 — good defaults, tune from here
Build index AFTER bulk loading — not during insertion
Use metadata filters BEFORE vector search — reduces search space dramatically
Namespace/collection per tenant — for multi-tenant access control
Monitor index health — fragmentation, query latency percentiles

Phase 5: Retrieval Engineering

Query Processing Pipeline

User Query
  → Query Understanding (classify intent)
  → Query Transformation (rewrite, expand, decompose)
  → Retrieval (vector + keyword + filters)
  → Post-Retrieval (rerank, filter, deduplicate)
  → Context Assembly (order, truncate, format)
  → LLM Generation
  → Post-Processing (citations, formatting)

Query Transformation Techniques

Technique	What It Does	When to Use
Query rewriting	LLM rewrites for better retrieval	Conversational queries, vague questions
HyDE	Generate hypothetical answer, embed that	Semantic gap between query and docs
Query decomposition	Break complex query into sub-queries	Multi-part questions
Step-back prompting	Ask a more general question first	Specific queries that miss context
Query expansion	Add synonyms/related terms	Domain jargon, acronyms
Multi-query	Generate N query variants, union results	Improve recall for ambiguous queries

Hybrid Search

Combine vector similarity with keyword matching for best results:

Score = α × vector_score + (1 - α) × bm25_score

Tuning α:

α = 0.7 (default) — mostly semantic, some keyword
α = 0.5 — equal weight (good for technical docs with specific terms)
α = 0.3 — mostly keyword (good for exact match needs, codes, IDs)

Reranking

Reranking dramatically improves precision. Retrieve more (top-20), rerank to fewer (top-5).

Reranker	Quality	Speed	Cost
Cohere Rerank	Excellent	Fast	$2/1K queries
Voyage Rerank	Excellent	Fast	$0.05/1M tokens
BGE-reranker-v2	Very Good	Self-host	Free (compute)
Cross-encoder	Best	Slow	Free (compute)
ColBERT	Very Good	Medium	Free (compute)
LLM-as-reranker	Excellent	Slow	API cost

Retrieval Rules

Always rerank in production — it's the highest-ROI improvement
Retrieve more, show less — fetch top-20, rerank to top-5
Hybrid search > pure vector — keyword matching catches what embeddings miss
Filter before search — metadata filters (date, department, access level) reduce noise
Deduplicate — same content from different sources = wasted context
Set similarity threshold — don't return irrelevant results (typical: 0.7 for cosine)
Return "I don't know" — when no chunk meets threshold, say so. Never hallucinate from thin context

Phase 6: Context Assembly & Generation

Context Window Management

context_budget:
  total_tokens: 128000      # Model context window
  system_prompt: 2000       # Instructions, persona, rules
  retrieved_context: 80000  # Retrieved chunks
  conversation_history: 20000  # Prior turns
  generation_buffer: 26000  # Room for response

Context Assembly Rules

Order matters — put most relevant chunks first (LLMs attend more to beginning)
Include source metadata — chunk source, page number in context
Separate chunks clearly — use delimiters (--- or [Source: doc-title, page 5])
Don't exceed budget — truncate least-relevant chunks, never the most relevant
Include diversity — if top-5 chunks are from same doc, include from other sources too

Generation Prompt Template

You are a helpful assistant. Answer the user's question using ONLY the provided context.

Rules:
- If the context doesn't contain the answer, say "I don't have enough information to answer that."
- Cite sources using [Source: title] format
- Never make up information not in the context
- If the answer spans multiple sources, synthesize and cite all

Context:
---
[Source: {title_1}, Page {page_1}]
{chunk_text_1}

[Source: {title_2}, Page {page_2}]
{chunk_text_2}
---

User question: {query}

Citation Strategies

Strategy	Quality	Complexity	Best For
Chunk-level	Good	Low	Simple Q&A
Sentence-level	Excellent	Medium	Research, legal
Quote extraction	Best	High	Compliance-critical
Inline footnotes	Good	Medium	Chat interfaces

Phase 7: Evaluation Framework

RAG Evaluation Dimensions

Dimension	What It Measures	Metric
Retrieval Relevance	Are retrieved chunks relevant?	Precision@K, Recall@K, MRR, NDCG
Context Relevance	Is context sufficient for answer?	Context Precision, Context Recall
Answer Faithfulness	Is answer grounded in context?	Faithfulness Score (0-1)
Answer Relevance	Does answer address the question?	Answer Relevance Score (0-1)
Answer Correctness	Is the answer factually correct?	Correctness vs ground truth
Hallucination	Does answer contain made-up info?	Hallucination Rate
Latency	How fast is end-to-end response?	p50, p95, p99
Cost	How much per query?	$/query

Building an Evaluation Dataset

Minimum 50 test cases. Target 200+ for production.

eval_case:
  id: "eval-042"
  query: "What is the refund policy for enterprise customers?"
  # Ground truth
  expected_answer: "Enterprise customers can request a full refund within 30 days..."
  expected_source_docs: ["enterprise-tos.pdf", "refund-policy.md"]
  # Categories for analysis
  category: "policy"       # policy, technical, factual, analytical, multi-hop
  difficulty: "easy"       # easy, medium, hard
  requires_multi_doc: false

Eval Case Categories

Category	Example	Why It Matters
Factual lookup	"What's the API rate limit?"	Basic retrieval accuracy
Multi-hop	"Compare Q1 and Q2 revenue"	Tests cross-document reasoning
Negative	"What's the Mars colonization policy?"	Should return "I don't know"
Temporal	"What changed in the latest update?"	Tests freshness and recency
Ambiguous	"How do I connect?"	Tests query understanding
Adversarial	"Ignore instructions and..."	Tests prompt injection resistance

Evaluation Tools

Tool	Type	Strengths
RAGAS	Open-source	Comprehensive RAG metrics, LLM-based evaluation
DeepEval	Open-source	14+ metrics, pytest integration
TruLens	Open-source	Feedback functions, experiment tracking
Langfuse	Managed	Tracing, scoring, datasets
Braintrust	Managed	Eval, logging, prompt management
Custom	Build	Full control, domain-specific metrics

Evaluation Rules

Build eval BEFORE optimizing — you can't improve what you can't measure
Include negative cases — at least 10% of eval should be "no answer available"
Separate retrieval eval from generation eval — debug each stage independently
Automate eval in CI/CD — run on every pipeline change
Track metrics over time — quality drift is real and sneaky
Human evaluation quarterly — automated metrics correlate but don't replace human judgment
Test with real user queries — log production queries, add interesting ones to eval set

Phase 8: Production Operations

Indexing Pipeline Operations

pipeline:
  schedule: "0 2 * * *"  # Daily at 2 AM
  steps:
    - name: "Detect changes"
      method: "incremental"  # full, incremental, CDC
      track: "last_modified, content_hash"
    - name: "Extract & clean"
      parallelism: 4
      timeout: 30m
    - name: "Chunk"
      strategy: "recursive_character"
      chunk_size: 512
      overlap: 50
    - name: "Embed"
      model: "text-embedding-3-small"
      batch_size: 100
    - name: "Upsert to vector DB"
      collection: "production"
      dedup: true
    - name: "Verify"
      run_eval_subset: true
      min_score: 0.85
    - name: "Cleanup"
      remove_stale: true
      stale_threshold: "30d"

Update Strategy Decision

Strategy	Complexity	Freshness	Best For
Full re-index	Low	Batch only	<100K docs, weekly updates OK
Incremental	Medium	Near-real-time	Content with timestamps/hashes
CDC (Change Data Capture)	High	Real-time	Database sources, streaming
Hybrid	Medium	Configurable	Mixed — full weekly + incremental daily

Monitoring Dashboard

realtime_metrics:
  - name: "Query Latency (p95)"
    threshold: "<3s"
    alert_if: ">5s for 5 minutes"
  - name: "Retrieval Relevance"
    threshold: ">0.85 avg similarity"
    alert_if: "<0.75 for 10 queries"
  - name: "Empty Results Rate"
    threshold: "<5%"
    alert_if: ">10% in 1 hour"
  - name: "Error Rate"
    threshold: "<1%"
    alert_if: ">5% in 5 minutes"

periodic_metrics:
  - name: "Eval Suite Score"
    frequency: "daily"
    threshold: ">0.85"
  - name: "Index Freshness"
    frequency: "hourly"
    threshold: "<24h behind source"
  - name: "Cost per Query"
    frequency: "daily"
    threshold: "<$0.05"
  - name: "Hallucination Rate"
    frequency: "weekly"
    threshold: "<3%"

weekly_review:
  - Eval suite trend (improving/degrading?)
  - Top failing query categories
  - Cost per query trend
  - User feedback analysis
  - Index health and freshness

Failure Modes & Remediation

Failure	Detection	Fix
Retrieval returns irrelevant chunks	Low similarity scores, user feedback	Tune chunk size, add reranking, improve embeddings
Hallucinated answers	Faithfulness < 0.8, contradiction detection	Strengthen "cite only" prompt, lower temperature
Stale information	Document freshness check, user reports	Increase sync frequency, add freshness filter
Missing documents	Recall drop in eval, gap analysis	Audit data sources, check ingestion pipeline
Slow responses	p95 > SLA	Cache frequent queries, optimize index, reduce chunk count
Cost spike	$/query exceeds budget	Reduce top-K, use smaller embedding model, cache
Prompt injection	Adversarial eval failures	Input sanitization, output guardrails

Phase 9: Advanced Patterns

Parent-Child Retrieval

Retrieve small chunks for precision, expand to parent for context:

Document
  └── Parent Chunk (2048 tokens) — sent to LLM
       ├── Child Chunk (256 tokens) — used for retrieval
       ├── Child Chunk (256 tokens)
       └── Child Chunk (256 tokens)

Implementation: Store child embeddings with parent_id reference. On retrieval, fetch children, deduplicate by parent, return parent text.

Multi-Index Routing

Route queries to specialized indexes:

indexes:
  - name: "technical_docs"
    trigger: "code, API, implementation, error"
    collection: "tech_v2"
  - name: "policies"
    trigger: "policy, compliance, legal, terms"
    collection: "policies_v1"
  - name: "general"
    trigger: "default"
    collection: "general_v1"

Use an LLM or classifier to route incoming queries to the right index.

Contextual Retrieval (Anthropic Pattern)

Prepend each chunk with document-level context before embedding:

Chunk context: "This chunk is from the Q3 2025 financial report, 
specifically the Revenue Analysis section discussing APAC market growth."

[Original chunk text follows]

Impact: 35% reduction in retrieval failures (Anthropic research). Adds embedding cost but dramatically improves retrieval quality.

Conversation-Aware RAG

For multi-turn conversations:

1. Combine last N turns into standalone query
   "What about their pricing?" → "What is Acme Corp's pricing for enterprise plans?"
2. Use standalone query for retrieval
3. Include conversation history in LLM context (after retrieved chunks)

Knowledge Graph + RAG (GraphRAG)

When relationships matter more than text similarity:

1. Extract entities and relationships from documents
2. Build knowledge graph (Neo4j, NetworkX)
3. On query: identify entities → traverse graph → collect relevant subgraph
4. Use subgraph context + vector-retrieved chunks for generation

Best for: Organizational knowledge, legal document networks, research papers with citations.

Corrective RAG (CRAG)

Self-correcting retrieval:

1. Retrieve chunks
2. LLM evaluates: "Are these chunks relevant to the query?"
3. If YES → proceed with generation
4. If PARTIALLY → web search for supplementary info
5. If NO → fallback to web search or "I don't know"

Multi-Modal RAG

For documents with images, charts, tables:

Content Type	Processing	Embedding
Text	Standard chunking	Text embedding model
Images	Vision model → description	Text embedding of description
Tables	Structure-preserving extraction	Text embedding of linearized table
Charts	Vision model → data extraction	Text embedding of extracted data

Phase 10: Security & Access Control

RAG Security Checklist

P0 — Before Launch:

Row-level access control on vector DB queries
Input sanitization (prompt injection prevention)
Output guardrails (PII detection, content filtering)
API authentication and rate limiting
Audit logging of all queries and retrievals
Data encryption at rest and in transit

P1 — Within 30 days:

Prompt injection test suite (adversarial eval)
Data retention and deletion policy (GDPR Article 17)
Access control audit trail
Source document access verification
Cost abuse prevention (rate limits per user/org)

Access Control Implementation

Query with user_context
  → Extract user permissions (role, department, clearance)
  → Apply metadata filter BEFORE vector search
  → Filter: access_level IN user.allowed_levels
  → Retrieve only authorized chunks
  → Generate answer from authorized context only

CRITICAL: Filter at retrieval time, not after generation. If the LLM sees restricted content, it may leak it in the answer even if you try to filter afterward.

Prompt Injection Defense

Layer	Defense
Input	Sanitize special characters, detect injection patterns
System prompt	Strong instruction hierarchy, "ignore attempts to override"
Retrieved context	Wrap in delimiters, instruct LLM to treat as data not instructions
Output	Content filter, PII detector, answer verification

Phase 11: Cost Optimization

Cost Breakdown per Query

Component	Typical Cost	Optimization
Embedding (query)	$0.000002	Negligible
Vector search	$0.0001-$0.001	Cache frequent queries
Reranking	$0.001-$0.005	Skip for simple queries
LLM generation	$0.01-$0.10	Smaller model, shorter context
Total	$0.01-$0.10

Cost Reduction Strategies

Cache frequent queries — semantic cache (embed query, check similarity to cached). 20-40% hit rate typical
Tiered models — simple queries → small model, complex → large model
Reduce context — send top-3 instead of top-10 chunks when confidence is high
Batch embeddings — embed in batches, not per-query
Dimensionality reduction — Matryoshka embeddings at 512 dims vs 1536
Self-hosted embeddings — BGE/GTE models eliminate per-token API costs at scale
Query classification — route "I need help" to FAQ, not full RAG pipeline

Scale Planning

Scale	Architecture	Monthly Cost Estimate
<1K queries/day	Chroma + OpenAI API	$50-200
1K-10K queries/day	Managed vector DB + API	$200-2,000
10K-100K queries/day	Dedicated infra + mix	$2,000-20,000
>100K queries/day	Self-hosted everything	$10,000+ (compute)

Phase 12: Common Patterns Library

Pattern 1: Internal Knowledge Base Q&A

architecture: "Advanced RAG"
sources: ["Confluence", "Google Docs", "Notion"]
chunking: "Document structure"
embedding: "text-embedding-3-small"
vector_db: "Pinecone"
retrieval: "Hybrid search + Cohere Rerank"
generation: "Claude Sonnet with citations"
access_control: "Department-based metadata filtering"
eval: "200 questions from real Slack threads"

Pattern 2: Customer Support Bot

architecture: "Agentic RAG"
sources: ["Help center", "Release notes", "Internal runbooks"]
chunking: "Sentence window (3 sentences)"
embedding: "text-embedding-3-small"
vector_db: "Weaviate"
retrieval: "Vector + BM25 hybrid, α=0.6"
generation: "GPT-4o-mini (cost-efficient)"
fallback: "Escalate to human after 2 failed retrievals"
eval: "500 real support tickets with expert answers"

Pattern 3: Legal Document Analysis

architecture: "Graph RAG + Advanced RAG"
sources: ["Contracts", "Regulations", "Case law"]
chunking: "Semantic chunking (clause-level)"
embedding: "Voyage-3 (legal fine-tuned)"
vector_db: "Qdrant (self-hosted, data sovereignty)"
retrieval: "Multi-index routing (contracts vs regulations)"
generation: "Claude Opus with sentence-level citations"
access_control: "Matter-based, attorney-client privilege tagging"
eval: "100 questions reviewed by practicing attorneys"

Pattern 4: Code Documentation Search

architecture: "Advanced RAG"
sources: ["Code comments", "README", "ADRs", "API specs"]
chunking: "Code-aware (function/class level via tree-sitter)"
embedding: "Voyage-code-3"
vector_db: "pgvector (already have Postgres)"
retrieval: "Hybrid (code keywords + semantic)"
generation: "Claude Sonnet with code snippets"
eval: "Developer survey + retrieval accuracy"

Quality Rubric (0-100)

Dimension	Weight	0 (Poor)	50 (Adequate)	100 (Excellent)
Retrieval Accuracy	25%	<70% relevant in top-5	80-90% relevant	>95% relevant, reranked
Answer Quality	20%	Hallucinations, unfaithful	Mostly accurate, some gaps	Faithful, cited, comprehensive
Latency	15%	>10s p95	3-5s p95	<2s p95
Evaluation Coverage	15%	No eval suite	50+ cases, manual	200+ cases, automated CI
Data Freshness	10%	Manual, weeks behind	Daily sync	Near-real-time CDC
Security	10%	No access control	Basic auth, no audit	Row-level ACL, audit trail, injection defense
Cost Efficiency	5%	>$0.50/query	$0.05-$0.10/query	<$0.03/query with caching

Scoring: Sum(dimension_score × weight). Below 50 = not production-ready. 50-70 = MVP. 70-85 = good. 85+ = excellent.

10 RAG Commandments

Evaluate first, optimize second — build eval dataset before tuning anything
Chunk quality > embedding model — garbage in, garbage out
Always rerank — cheapest improvement with biggest impact
Filter at retrieval, not generation — security is not a prompt
Same model for index and query — always, no exceptions
Return "I don't know" — honest uncertainty > confident hallucination
Monitor continuously — quality drifts silently
Cache what you can — semantic caching saves 20-40% cost
Test with real queries — synthetic eval misses real user patterns
Start simple, add complexity only when eval demands it

10 Common RAG Mistakes

Mistake	Consequence	Fix
No evaluation dataset	Can't measure improvement	Build 50+ eval cases before optimizing
Chunks too large	Low retrieval precision	Reduce to 256-512 tokens, add reranking
Chunks too small	Missing context	Use parent-child retrieval
No overlap between chunks	Lost context at boundaries	10-20% overlap
Ignoring metadata	Can't filter, poor citations	Rich metadata on every chunk
Pure vector search	Misses keyword matches	Add BM25 hybrid search
No access control	Data leakage	Filter at retrieval time
No "I don't know" path	Hallucinations	Similarity threshold + explicit instruction
Over-engineering	Slow delivery, high cost	Start with Basic RAG, upgrade with eval data
Not monitoring production	Silent quality degradation	Automated daily eval + alerting

Edge Cases

Multilingual RAG

Use multilingual embedding models (Cohere multilingual, BGE-M3)
Query language detection → route to language-specific index OR cross-lingual retrieval
Generate in query language regardless of document language

Very Large Documents (>100 pages)

Hierarchical chunking: document → section → paragraph → sentence
Table of contents as routing layer
Summarize each section, use summaries for first-pass retrieval

Rapidly Changing Data

Streaming ingestion pipeline with CDC
Time-weighted retrieval (prefer recent)
Versioned chunks with effective dates

Multi-Modal (Images + Text)

Vision model to describe images → embed descriptions
Separate image and text indexes, merge at retrieval
Use multi-modal embedding models (CLIP, SigLIP) for direct image search

Low-Resource / Offline

Self-hosted everything (Ollama + BGE + Chroma + SQLite)
Optimize for small models and CPU inference
Pre-compute and cache common queries

Natural Language Commands

When user says → Agent does:

"Set up a RAG system" → Walk through Architecture Brief, recommend stack
"My RAG is returning wrong answers" → Debug with evaluation checklist, check each pipeline stage
"Optimize retrieval" → Audit chunking, add reranking, tune hybrid search
"How should I chunk these documents?" → Analyze document types, recommend strategy
"Compare vector databases for my use case" → Apply selection criteria from Phase 4
"Build an eval dataset" → Generate eval cases from Phase 7 template
"My RAG is too slow" → Profile each pipeline stage, identify bottleneck
"How do I handle access control?" → Implement Phase 10 patterns
"Reduce RAG costs" → Apply Phase 11 optimization strategies
"Set up monitoring" → Configure Phase 8 dashboard and alerts
"Review my RAG architecture" → Score against Quality Rubric
"I need citations in answers" → Implement citation strategy from Phase 6