# Atlas Vector Search Integration Patterns

## Index Definition

Atlas Vector Search indexes are defined as JSON and created via the MongoDB driver or Atlas UI.

### Basic Index Definition

```json
{
  "name": "vector_index",
  "type": "vectorSearch",
  "definition": {
    "fields": [
      {
        "type": "vector",
        "path": "embedding",
        "numDimensions": 1024,
        "similarity": "cosine"
      }
    ]
  }
}
```

### Index with Pre-filter Fields

```json
{
  "name": "vector_index",
  "type": "vectorSearch",
  "definition": {
    "fields": [
      {
        "type": "vector",
        "path": "embedding",
        "numDimensions": 1024,
        "similarity": "cosine"
      },
      {
        "type": "filter",
        "path": "category"
      },
      {
        "type": "filter",
        "path": "status"
      }
    ]
  }
}
```

### Similarity Functions

| Function | Best For | Notes |
|----------|----------|-------|
| `cosine` | General purpose | Normalized, direction-based. Default choice. |
| `dotProduct` | Pre-normalized vectors | Faster than cosine when vectors are unit length |
| `euclidean` | Absolute distance matters | Sensitive to vector magnitude |

**Recommendation:** Use `cosine` unless you have a specific reason not to.

## $vectorSearch Aggregation Stage

### Basic Vector Search

```javascript
db.collection.aggregate([
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: [0.1, 0.2, ...],  // Your query embedding
      numCandidates: 150,
      limit: 10
    }
  },
  {
    $project: {
      _id: 1,
      text: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
])
```

### Vector Search with Pre-filters

```javascript
db.collection.aggregate([
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: [0.1, 0.2, ...],
      numCandidates: 150,
      limit: 10,
      filter: {
        $and: [
          { category: { $eq: "technology" } },
          { status: { $eq: "published" } }
        ]
      }
    }
  },
  {
    $project: {
      text: 1,
      category: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
])
```

**Important:** Filter fields must be declared in the index definition with `"type": "filter"`.

### Supported Filter Operators

- `$eq`, `$ne`
- `$gt`, `$gte`, `$lt`, `$lte`
- `$in`, `$nin`
- `$and`, `$or`

## Hybrid Search (Vector + Full-Text)

Combine vector search with Atlas Search for hybrid retrieval using `$unionWith` or reciprocal rank fusion.

### Reciprocal Rank Fusion (RRF)

```javascript
// Stage 1: Vector search results
const vectorResults = await collection.aggregate([
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: queryEmbedding,
      numCandidates: 100,
      limit: 20
    }
  },
  { $addFields: { vs_score: { $meta: "vectorSearchScore" } } },
  { $project: { text: 1, vs_score: 1 } }
]).toArray();

// Stage 2: Text search results
const textResults = await collection.aggregate([
  {
    $search: {
      index: "text_index",
      text: { query: "search terms", path: "text" }
    }
  },
  { $limit: 20 },
  { $addFields: { ts_score: { $meta: "searchScore" } } },
  { $project: { text: 1, ts_score: 1 } }
]).toArray();

// Combine with RRF
const k = 60; // RRF constant
const combined = {};
vectorResults.forEach((doc, i) => {
  const id = doc._id.toString();
  combined[id] = combined[id] || { ...doc, rrf: 0 };
  combined[id].rrf += 1 / (k + i + 1);
});
textResults.forEach((doc, i) => {
  const id = doc._id.toString();
  combined[id] = combined[id] || { ...doc, rrf: 0 };
  combined[id].rrf += 1 / (k + i + 1);
});
const finalResults = Object.values(combined).sort((a, b) => b.rrf - a.rrf);
```

## Best Practices

### 1. Always Use `input_type`

```bash
# For queries (what you're searching for)
voyageai embed "how to scale MongoDB" --input-type query

# For documents (what you're indexing)
voyageai embed --file article.txt --input-type document
```

This asymmetric encoding improves retrieval quality by 3-5% on average.

### 2. Dimension Selection

| Dimensions | Storage per Vector | Use Case |
|------------|-------------------|----------|
| 256 | 1 KB | Mobile/edge, very large collections |
| 512 | 2 KB | Good balance for large collections |
| 1024 | 4 KB | **Recommended default** |
| 2048 | 8 KB | Maximum quality, smaller collections |

### 3. numCandidates Tuning

- `numCandidates` should be 10-20× your `limit`
- Higher values → better recall, slower queries
- Start with `numCandidates: 150, limit: 10` and adjust

### 4. Batch Embedding for Storage

When storing many documents, batch your embeddings:
```bash
# Embed multiple texts at once (up to 128 per request)
cat documents.txt | voyageai embed --input-type document --json
```

### 5. Model Compatibility

All voyage-4 series models produce compatible embeddings. Strategy:
- Index documents with `voyage-4-large` (best quality)
- Query with `voyage-4-lite` (cheapest, fastest)
- Results are still high quality due to shared embedding space

## Example Pipeline: End-to-End

```bash
# 1. Create the collection index
voyageai index create \
  --db semanticapp \
  --collection knowledge \
  --field embedding \
  --dimensions 1024 \
  --similarity cosine \
  --index-name knowledge_vector

# 2. Ingest documents
for file in docs/*.txt; do
  voyageai store \
    --db semanticapp \
    --collection knowledge \
    --field embedding \
    --file "$file" \
    --model voyage-4-large \
    --metadata "{\"source\": \"$file\"}"
done

# 3. Search
voyageai search \
  --query "How does sharding work in MongoDB?" \
  --db semanticapp \
  --collection knowledge \
  --index knowledge_vector \
  --field embedding \
  --model voyage-4 \
  --limit 5

# 4. Optional: Rerank the results for better precision
voyageai search \
  --query "How does sharding work?" \
  --db semanticapp \
  --collection knowledge \
  --index knowledge_vector \
  --field embedding \
  --limit 20 --json | \
  jq '[.[].text]' | \
  voyageai rerank --query "How does sharding work?" --documents-file /dev/stdin --top-k 5
```
