Install
openclaw skills install qdrant-ingestion-best-practicesProvides production-grade guidance for designing, ingesting, and retrieving data in Qdrant-based RAG pipelines with best practices for chunking, metadata, mo...
openclaw skills install qdrant-ingestion-best-practicesThis skill package provides comprehensive, production-grade guidance for building RAG (Retrieval-Augmented Generation) pipelines using Qdrant as the vector store. It covers everything from data ingestion and chunking to hybrid retrieval, metadata standards, and access control patterns.
All detailed guidance lives in the guides/ subfolder. Always read the relevant guide(s) before writing code or designing a pipeline. Use the Quick Decision Guide below to determine which guides to load.
| Guide | Path | When to Read |
|---|---|---|
| RAG Pipeline Overview | guides/01-rag-pipeline-overview.md | Start here. Architecture, decisions, model selection. |
| Metadata Schema Standards | guides/02-metadata-schema.md | Designing chunk payloads and payload index strategy. |
| Data Classification & Collections | guides/03-data-classification.md | Multi-collection design, sensitivity tiers, tenancy. |
| Source Normalization | guides/04-source-normalization.md | Pre-processing rules by source type before chunking. |
| Chunking Standards | guides/05-chunking-standards.md | Chunk size, overlap, strategy by content type. |
| Embedding Models | guides/06-embedding-models.md | Dense vs hybrid model selection and configuration. |
| Ingestion Pipeline | guides/07-ingestion-pipeline.md | Full pipeline steps, idempotency, upsert patterns. |
| Retrieval Architecture | guides/08-retrieval-architecture.md | Hybrid search, RRF, reranking, filter application. |
| Access Control Patterns | guides/09-access-control.md | Payload-based filtering, separation of concerns. |
| Operational Standards | guides/10-operational-standards.md | Lifecycle, retention, observability, conformance. |
| Quick Reference | QUICK_REFERENCE.md | Cheat sheet: model dims, chunk sizes, RRF params. |
Read guides/06-embedding-models.md for full details. Quick answer:
Need hybrid (semantic + keyword)?
→ BAAI/BGE-M3 (dense 1024-dim + SPLADE sparse, single model pass)
Need dense-only (simpler pipeline)?
→ text-embedding-3-small (OpenAI, 1536-dim, cost-efficient)
→ text-embedding-3-large (OpenAI, 3072-dim, highest quality dense)
Read guides/05-chunking-standards.md for full code. Quick answer:
Conversational (Slack, short messages) → 150–300 tokens, 30 overlap, sentence window
Email threads → Split at reply boundary first, then 200–400 tokens
Meeting transcripts → Split at speaker turns, 200 tokens, 20 overlap
Documents / PDFs → Hierarchical paragraph, 300–500 tokens, 50 overlap
Tasks / Tickets → One chunk per task, max 512 tokens
Read guides/03-data-classification.md. Justify collections by: security boundary, query pattern, scale/index tuning, or lifecycle difference. Do NOT create a collection per data source. Standard setup = 3 collections: company_memory, restricted_memory, pii_memory.
Read guides in this order: 01 → 02 → 03 → 06 → 07 → 08
Read: 08 → 05 → 06 → 10
Read: 09 → 03 → 02 (focus on governance fields)
Read: 04 → 05 → 02 → 07
Read: 07 → 10 → 02
model_inferred fields must not be sole basis for security decisions.Every ingestion pipeline must execute these stages in this exact order:
1. Source capture — fetch raw content + metadata from source API
2. Normalization — apply universal + source-specific rules (→ guide 04)
3. Document hash — SHA-256 of full normalized document text
4. Change detection — compare hash to stored hash; skip steps 5–8 if unchanged
5. Chunking — apply strategy for content type (→ guide 05)
6. Chunk hashing — SHA-256 per chunk from normalized chunk text
7. Embedding — dense ± sparse vectors (→ guide 06)
8. Upsert to Qdrant — full metadata payload (→ guide 07)
9. Stale chunk cleanup — delete chunks whose chunk_index is now out of range
| Tier | Examples | Collection |
|---|---|---|
public | Marketing, public docs | company_memory |
internal | Slack, all-hands, project docs | company_memory |
restricted | Executive email, finance, legal | restricted_memory |
confidential | Salary, PII, health records | pii_memory |
Default when nothing matches: internal
Query → Embed (BGE-M3: dense + sparse in one pass)
→ Dense search top-20 ─┐
→ Sparse search top-20 ─┤ (access filters applied inside each branch)
▼
RRF fusion (k=60)
▼
Top 10–15 results
▼
Optional: cross-encoder rerank → top 5–8
▼
Return with attribution
See guides/08-retrieval-architecture.md for full implementation code.