OpenMark AI Model Router
Benchmark-driven model routing for OpenClaw. Uses your own benchmark data to route tasks to better-fit models and reduce overuse of expensive flagship models.
Install
openclaw plugins install clawhub:openmark-routerOpenMark AI Model Router
Benchmark-driven model routing for OpenClaw, powered by OpenMark AI.
Use benchmark-driven routing instead of guessing, complexity heuristics, or manual model switching. The router uses one lightweight semantic classification call to identify the task category, then deterministically picks the best model and fallbacks from your OpenMark benchmark data and lets the routed model generate the real reply. In workloads where many requests do not actually need a flagship model, this can materially reduce spend while preserving or improving task quality, and some workloads may see savings on the order of 50-80% or more depending on task mix, provider availability, and benchmark results.
Install (recommended)
openclaw plugins install openmark-router
openclaw gateway restart
For normal published-plugin installs, that's it. The plugin auto-registers as a provider, sets openmark/auto as your default model, and starts routing.
Published package note:
- the shipped runtime package includes compiled
dist/files plus the bundled Python router underscripts/ - the Python router requires Python 3.8+ on the host
Install From Source (development)
Ignore this subsection unless you intentionally cloned this repository and want to run the plugin from source instead of installing the published plugin.
npm install
npm run build
openclaw gateway restart
After any install or update, restart the OpenClaw gateway so the newly built plugin files are loaded.
If you change your real default model and want the router to use it for classification or passthrough, restart the gateway again so the plugin captures the new value.
Local Dashboard
The router now ships with a small local dashboard on the same embedded server.
- Open
http://127.0.0.1:2098/dashboardby default, or use your configuredport. - It currently shows router health, detected providers, benchmark categories/freshness, and a small config summary.
- It lets you edit
routing_strategyandshow_routing_card. - It can import benchmark CSVs directly into the configured benchmark directory and includes guidance for exporting those CSVs from OpenMark AI.
- It can surface benchmark category descriptions directly from the dashboard.
- It can delete imported benchmark CSVs from the configured benchmark directory.
30-Second Example
You send:
Write me a LinkedIn post about our product launch
The router classifies the task, checks your benchmark results, chooses the best model for that category, and you see:
Routed to gpt-5.4-nano (openai) — Content Creation Benchmark
Benchmark: 92.9% score | $0.002731/call | 30.28s
Why this route: better score than gemini-3.1-pro, 97.6% cheaper, 4.2x faster
Over 10K calls: $27.31 vs $1148.36
Strategy: balanced | Benchmark data: fresh
[actual response from gpt-5.4-nano follows here...]
The routed model genuinely answers. The classifier does not generate the final user-visible answer.
Validated Surfaces
The seamless same-turn routing flow has been explicitly validated in:
- OpenClaw CLI
- OpenClaw Telegram
Other OpenClaw surfaces may also work when they pass through the same provider and hook flow, but they have not been exhaustively validated yet.
Why This Is Useful
OpenMark Router helps you benchmark your real tasks, find better-fit models for them, and often reduce spend substantially. The biggest win usually comes from stopping the habit of sending every request to one expensive flagship model when your own benchmark data shows a smaller or cheaper model is already good enough for that category.
- Benchmark-driven, not heuristic routing: task selection is based on semantic classification plus benchmark results on your own tasks, not on a simplistic
simple vs complexsplit. - Better model choice without manual switching: the router picks the best benchmarked candidate and fallbacks on the fly.
- Routing adapts to what the user actually has configured: it detects available providers/hosts, prefers direct provider keys first, and can fall back to OpenRouter keys when benchmark rows include them and OpenRouter is available.
- Single-provider setups still benefit: you do not need a multi-provider stack. You can benchmark and route within one provider only if that is what your OpenClaw install uses.
- Classifier cost can stay negligible: the classifier call is isolated and lightweight.
- A safe default path still exists: Messages that don't have corresponding benchmarked tasks continue through the passthrough/default model path.
- Routing cards add useful visibility: users can see which model was chosen, why, and when routing happened.
- OpenClaw-owned auth flow: users do not hand provider API keys directly to the plugin. OpenClaw handles provider authentication, request formatting, and model execution.
- Your real default model is preserved: the plugin captures your existing OpenClaw default model before switching the runtime default to
openmark/auto.
Quick Start
- Benchmark your recurring tasks on OpenMark AI (100+ models)
- Export -- click Export -> OpenClaw on the Results tab
- Place the CSV in
~/.openclaw/workspace/plugins/openmark-router/benchmarks/ - Done -- the router activates automatically
You can still place CSVs manually, but the local dashboard now also provides a CSV import flow with OpenMark export guidance.
How It Works
The plugin uses an internal two-phase architecture: Phase 1 classifies and routes, Phase 2 generates the real reply from the optimal model. To the user, this still appears as a single reply. OpenClaw handles provider authentication and API formatting, and the plugin does not make direct provider API calls itself.
Turn 1 — Classification & Routing
──────────────────────────────────
User sends message
|
v
OpenClaw routes to openmark/auto (registered provider)
|
v
Embedded server classifies via OpenClaw gateway loopback:
- Sends ONLY the current user message + category names
- Uses your configured classifier model, or your captured default model if unset
- Returns the matching category name
|
v
Deterministic model selection (router.py, ~60ms after classification):
- Loads benchmark data, ranks models by strategy
- Computes optimal model + fallbacks for this run
- Returns pre-formatted routing card
|
v
OpenClaw immediately runs the real reply with the routed model stack
- Full session context, system prompt, conversation history
- Authentication and streaming handled by OpenClaw
|
v
User receives one reply containing the routing card plus the answer from the best model for their task
|
v
Compatibility fallback only: if the internal rerun path is unavailable, the plugin can still persist the route and ask for a follow-up message
No direct provider API calls. Classification goes through the OpenClaw gateway. The seamless hot path uses in-memory model overrides for the internal rerun, and the compatibility fallback can still write OpenClaw-managed routing state when needed. Provider authentication and model execution remain inside OpenClaw.
Runtime Access
The plugin is not a passive benchmark viewer. To implement same-turn routing and compatibility fallback behavior, it intentionally does the following:
- executes the bundled
scripts/router.pyvia a local Python subprocess - reads and updates
~/.openclaw/openclaw.jsonwhen injecting the provider and when compatibility fallback needs a persisted routed model stack - reads and updates
~/.openclaw/agents/main/sessions/state for temporary session model binding and snapshot continuity during same-turn routed turns - temporarily overrides runtime model selection inside OpenClaw hooks for the duration of a routed reply
What it does not do:
- it does not ask users to paste provider API keys into plugin-specific settings
- it does not make outbound provider API requests itself; OpenClaw handles provider auth/execution
- the Python routing engine does not depend on third-party pip packages and does not need external network calls to rank benchmark rows
What Happens Per Message
- Route match: the plugin classifies the message, chooses the benchmark winner plus fallbacks, and the routed model answers in the same visible reply flow.
- No route match: the plugin keeps the message on the passthrough/default path. The answer still happens on the same turn.
- Short messages: very short messages skip the classifier and stay on the passthrough/default path.
- Slash commands and internal OpenClaw prompts: bypass routing completely.
- Compatibility fallback: only used when the seamless internal rerun path is unavailable.
Why Custom Benchmarking Matters
Every routing solution -- from complexity tiering to auto-classifiers -- does blanket categorization. This breaks because:
- Generic classification is too broad: "email tasks" lumps cold outreach, complaint triage, and legal notices together. Model performance varies dramatically across these subtypes.
- Generic benchmarks are equally broad: MMLU, Arena Elo, and HumanEval test general capabilities. A model scoring well on "writing" tells you nothing about your email templates with your tone requirements.
- Real cost is invisible in pricing: Published $/M token rates are misleading. Tokenization differs, chain-of-thought inflates output tokens. A model at $0.60/M can cost more per call than one at $3/M.
When you benchmark on OpenMark AI, you test models on your specific task, with your prompts, against your criteria.
Routing Engine
The full system is not a heuristic router. An LLM first performs semantic task classification on the current message. After that category is known, the model-selection step is fully deterministic and uses benchmark data plus the selected strategy to rank the available candidates. The deterministic model-selection phase itself completes in ~60ms.
6-Step Cascade Sort
- Incomplete models pushed to bottom
- Score descending
- Accuracy per dollar descending
- Accuracy per minute descending
- Cost ascending
- Model name alphabetical (deterministic tiebreaker)
Five Strategies
| Strategy | What It Optimizes |
|---|---|
balanced | Weighted: accuracy (40%) + cost-efficiency (20%) + speed (25%) + stability (15%) |
best_score | Highest benchmark accuracy |
best_cost_efficiency | Best accuracy per dollar among viable models |
best_under_budget | Highest score within your cost ceiling |
best_under_latency | Highest score within your latency ceiling |
Viability Floor
floor = max(top_score - 15pp, top_score * 0.5) -- models below the floor are excluded from routing.
Fallbacks
The router provides ranked fallback models from the same benchmark. On the seamless hot path, those fallbacks are injected into the effective OpenClaw model config for that reply run. If the internal rerun path is unavailable, the compatibility fallback still persists the routed primary + fallbacks to OpenClaw's config.
Configuration
Edit config.json in the plugin directory.
config.json is the source of truth for routing behavior. The plugin metadata schema exposes the core knobs, but the full router configuration lives in config.json.
The local dashboard can edit a small subset of those settings, currently routing_strategy and show_routing_card.
| Field | Default | Description |
|---|---|---|
classifier_model | "" (your default model) | Override: model for the isolated classification call. Leave empty to use your captured default model. |
no_route_passthrough | "" (your default model) | Override: model for unrouted messages and short-message passthrough. Leave empty to use your captured default model. |
routing_strategy | balanced | Default ranking strategy |
port | 2098 | Embedded server port |
gateway_port | 18789 | OpenClaw gateway port (for classification calls via loopback) |
show_routing_card | true | Prepend routing card to responses |
restore_delay_s | 30 | Fallback restore timer (used only if hook-based restore is unavailable) |
benchmarks_dir | benchmarks | Benchmark directory relative to the plugin root |
cost_ceiling | null | Max cost per run (for best_under_budget) |
latency_ceiling_s | null | Max seconds per run (for best_under_latency) |
freshness_warning_days | 30 | Warn when benchmark data is older than this |
min_completion_pct | 80 | Skip models below this completion rate |
min_stability_threshold | 10.0 | Skip models with variance above this |
fallback_count | 2 | Number of fallback models |
By default, the router uses your existing default model for both classification and passthrough. The plugin captures your default model on startup (before setting openmark/auto).
The TypeScript plugin reads the runtime-facing fields such as ports, classifier/passthrough models, routing-card display, and restore timing. The Python routing engine reads the benchmark-selection fields such as strategy, ceilings, freshness, benchmark directory, stability filters, and fallback count from the same config.json.
Classifier Recommendations
The plugin does not prefer expensive models for classification. The classifier call is intentionally isolated and lightweight, so a small cheap model is usually the right choice.
Good classifier candidates include:
google/gemini-3.1-flash-lite-previewopenai/gpt-5.4-nanoopenai/gpt-5.4-minianthropic/claude-haiku-4-5
Other models can work too. The important rule is: use a model id that OpenClaw itself can resolve.
For example, this mattered during validation:
anthropic/claude-haiku-4-5workedanthropic/claude-haiku-4.5failed because OpenClaw did not recognize that id
If you leave classifier_model empty, the plugin uses your captured default model instead.
Provider Access And OpenRouter Fallback
The router does not blindly trust every model listed in a benchmark CSV.
- It first tries to detect which providers/hosts your OpenClaw install can currently use.
- When a benchmark row includes both a direct provider key and an OpenRouter key, the router prefers the direct provider key first.
- If that direct provider is not available but
openrouteris available, the router falls back to the OpenRouter key for that same model.
In practice, this means:
- direct provider access is preferred when you already have it configured
- OpenRouter can expand coverage when the benchmark includes
OC OR Keyvalues and your OpenClaw setup has OpenRouter access - users do not need every benchmarked model to be available through direct APIs if an OpenRouter path exists for those rows
- users can also benchmark only one provider and still get useful routing within that provider's model lineup
This also means the router is still useful when your setup is based on subscriptions, hosted access, OAuth-backed providers, or even free-model-heavy setups, as long as OpenClaw can execute the model ids involved. On many of those setups, flagship models also tend to be the first ones to feel expensive, constrained, or practically rate-limited, so routing can stretch the value of the access you already have by reserving premium models for the tasks that actually need them.
Important billing caveat:
- the router can improve model choice and often reduce spend substantially based on your benchmark data
- in some workloads, especially when teams are overusing flagship models for routine tasks, benchmark-driven routing can reduce model spend by around 50-80% or more
- provider billing, quota, and rate-limit behavior still depend on the provider and account type, so treat the exact savings as workload-dependent rather than guaranteed
Important limitation:
- provider detection is best-effort
- it is based on what OpenClaw reports for configured providers/hosts
- exact model execution still depends on OpenClaw accepting the exact model id string
So provider/host access is the first filter, but canonical model ids still matter. A provider may be available while a specific model string is still rejected if the id is not the exact OpenClaw-supported form.
Troubleshooting Notes
Classifier Fallback Chain
The normal classifier path is:
- isolated simple-completion call through OpenClaw
- subagent fallback only if that isolated path fails
The subagent path exists as a resilience fallback, not as the preferred architecture. If you see subagent-related classifier logs, treat that as a troubleshooting clue that the isolated simple-completion path could not prepare or execute the selected classifier model cleanly.
Provider Detection Commands
These are the commands used to inspect provider/host availability:
openclaw models status --json
python scripts/router.py --detect-providers --force-detect
Run the Python command from the plugin root:
cd /path/to/openmark-router
python scripts/router.py --detect-providers --force-detect
Or, without changing directories, pass an absolute path to the script:
python /absolute/path/to/openmark-router/scripts/router.py --detect-providers --force-detect
Dashboard Scope
Current dashboard:
- router health/version
- detected providers/hosts currently available in OpenClaw
- active benchmark categories and freshness visibility
routing_strategyselectionshow_routing_cardtoggle- benchmark CSV import into the configured benchmarks directory
- benchmark CSV deletion from the configured benchmarks directory
- category-description modal from the loaded benchmark metadata
Still planned for later:
- manual routed-model lock/unlock
- richer OpenMark-branded UX and broader controls
CSV Format
Use Export -> OpenClaw on OpenMark AI. The CSV includes metadata headers and dual model keys:
# task_name: email_classification
# display_name: Email Classification Benchmark
# description: Classifies emails by intent, priority, and category.
"Model","Provider","OC Key","OC OR Key","Score (%)",...
OC Key: direct provider model key (e.g.,openai/gpt-5.4)OC OR Key: OpenRouter model key (e.g.,openrouter/openai/gpt-5.4)
Project Structure
openmark-router/
src/
index.ts # Plugin entry: registerProvider + registerService
server.ts # Embedded HTTP server (OpenAI-compatible)
classifier.ts # Isolated LLM classifier
provider-inject.ts # Auto-configure provider in openclaw.json
router-bridge.ts # TypeScript bridge to router.py subprocess
types.ts # Shared types
scripts/
router.py # Core routing engine (1200+ lines)
loader.py # OpenMark CSV parser
adapter.py # Model ID translation
dist/ # Compiled JS
benchmarks/
examples/
chatbot_potential.csv # Sample CSV
config.json # Routing configuration
package.json # Plugin package
openclaw.plugin.json # Plugin metadata
clawhub.json # ClawHub metadata
LICENSE # Apache-2.0
Trust and Security
- OpenClaw-managed credentials: users do not provide provider API keys directly to the plugin, and all model calls go through the OpenClaw gateway/runtime.
- Clean install: Python stdlib only for routing engine. No pip dependencies.
- Declared local writes: The plugin may update
openclaw.jsonplus main-session state under~/.openclaw/agents/main/sessions/to preserve same-turn routing continuity and compatibility fallback behavior. - No network requests from routing engine: All benchmark data is local CSV files.
- Isolated classification: The classifier call uses only the current user message and category names -- no main-session history, no routed-model context, no borrowed system prompt state.
- No provider preference: Classification and passthrough can use any model that OpenClaw can resolve. Small cheap models are encouraged.
- Best-effort provider filtering: The router attempts to detect which providers/hosts are available in OpenClaw and filters benchmark candidates accordingly. Direct provider keys are preferred first, with OpenRouter fallback when available for the same row.
Requirements
- Python 3.8+ (stdlib only)
- Node.js 18+ (for plugin runtime)
- OpenClaw 2026.3.28+
Links
- OpenMark AI (benchmark platform): openmark.ai
- OpenClaw (agent framework): github.com/openclaw/openclaw
- ClawHub (marketplace): clawhub.ai
License
The repository is currently Apache-2.0 licensed -- see LICENSE.
Additional repo notes:
- TRADEMARK.md: OpenMark brand and affiliation note
