Smart Code Search

Other

Search code and docs by meaning, not keywords. Powered by ColGREP/NextPlaid,

Install

openclaw skills install smart-code-search

Smart Code Search

Search code and docs by meaning, not just strings.

Powered by ColGREP and NextPlaid from LightOn — the engine behind the #1 ranked code retrieval model on MTEB and the #1 retriever on BrowseComp-Plus, OpenAI's hardest agentic search benchmark.

grep finds strings. This finds intent. Ask "payment capture logic" and get results from files that never contain those exact words — because it understands what your code does, not just what it says.

Why This Exists

Every developer has been here: you know what you're looking for but not where it lives. You chain 4 different grep -r attempts, guess filenames, scroll through directory trees. Coding agents are even worse — they grep, miss things, hallucinate file paths, waste tokens exploring blind.

ColGREP fixes this with multi-vector semantic search. It parses your code with Tree-sitter, embeds each function/method/class with token-level vectors, and ranks results by meaning. The model is 17M parameters, runs on CPU, and returns results in under a second.

The Numbers

MetricValue
MTEB Code Leaderboard#1 (LateOn-Code)
BrowseComp-Plus87.59% accuracy, beating all models up to 8B params (blog)
vs grep in coding agents70% win rate head-to-head
Model size17M params — 54× smaller than competing 8B models
Search latency200–900ms on CPU
API cost$0. Forever. Runs 100% local
PrivacyCode never leaves your machine

Install

brew install lightonai/tap/colgrep

Verify: colgrep --version

Quick Start

1. Index Your Project

cd /path/to/project
colgrep init

That's it. ColGREP parses every file with Tree-sitter, builds multi-vector embeddings on CPU, and stores the index in .colgrep/. Takes 30–60 seconds for ~1000 files. After this, the index auto-updates on every search — changed files are detected and re-indexed automatically.

2. Search

colgrep "natural language description of what you want"

Results are ranked by semantic relevance score. Higher = better match.

Examples:

colgrep "authentication middleware token validation"
colgrep "database migration rollback strategy"
colgrep "React form validation with error display"
colgrep "webhook retry logic with exponential backoff"

3. Combine Regex + Semantics

Filter files by regex pattern first, then rank semantically:

colgrep -e "async.*await" "error handling patterns"
colgrep -e "def test_" "payment capture edge cases"
colgrep -e "\.tsx$" "patient dashboard layout"

Search Options

colgrep "query"              # Default output: file:lines (score: X.XX)
colgrep "query" --json       # JSON output for piping to other tools
colgrep "query" -n 5         # Top 5 results only

When to Use This vs grep

You know...Use
The exact string or function namegrep -r "functionName"
The concept but not the wordscolgrep "what it does"
A pattern + a conceptcolgrep -e "pattern" "meaning"
Where something is implementedcolgrep "description of behavior"
How a feature works across filescolgrep "feature workflow"

Coding Agent Integration

ColGREP provides built-in integration with popular coding agents. After installing, restart your agent to enable semantic search:

  • Claude Code: colgrep --install-claude-code
  • OpenCode: colgrep --install-opencode
  • Codex: colgrep --install-codex

These commands register ColGREP as a search tool within the agent. The agent will automatically use semantic search when navigating indexed projects.

Multi-Project Setup

Index each project independently. Search from the project directory:

cd ~/code/api && colgrep init
cd ~/code/frontend && colgrep init
cd ~/code/infrastructure && colgrep init
cd ~/docs && colgrep init

# Search each independently
cd ~/code/api && colgrep "payment processing service"
cd ~/code/frontend && colgrep "checkout form validation"

Works great for monorepos, microservices, documentation vaults, and any directory with text/code files.

How It Works

ColGREP uses ColBERT late-interaction retrieval — a fundamentally different approach than traditional single-vector embeddings:

  1. Tree-sitter parses your code into structured units (functions, methods, classes, signatures)
  2. LateOn-Code-edge (17M params) creates multiple token-level embeddings per code unit — not one lossy summary vector
  3. NextPlaid stores these in a quantized, memory-mapped Rust index
  4. At search time, query tokens interact with document tokens for fine-grained relevance scoring

This is why a 17M model beats 8B models — late interaction preserves token-level semantics that single-vector approaches compress away. Read the full technical story: The Bloated Retriever Era Is Over

Interpreting Scores

  • 6.0+ — Near-exact conceptual match. The code does exactly what you described.
  • 5.0–6.0 — Strong semantic match. Highly relevant code.
  • 4.0–5.0 — Good match. Related code worth reviewing.
  • 3.0–4.0 — Weak match. May or may not be relevant.
  • Below 3.0 — Likely noise. Ignore these results.

Troubleshooting

"Index is being updated by another process" — Another colgrep instance is updating. Current search uses existing index. Safe to ignore.

Re-index from scratch:

rm -rf .colgrep/ && colgrep init

Add to .gitignore:

.colgrep/

Links