OpenMark AI Model Router

Benchmark-driven model routing for OpenClaw. Uses your own benchmark data to route tasks to better-fit models and reduce overuse of expensive flagship models.

Marc Kean Paker@openmarkai

Install

openclaw plugins install clawhub:openmark-router

OpenMark AI Model Router

Benchmark-driven model routing for OpenClaw, powered by OpenMark AI.

Use benchmark-driven routing instead of guessing, complexity heuristics, or manual model switching. The router uses one lightweight semantic classification call to identify the task category, then deterministically picks the best model and fallbacks from your OpenMark benchmark data and lets the routed model generate the real reply. In workloads where many requests do not actually need a flagship model, this can materially reduce spend while preserving or improving task quality, and some workloads may see savings on the order of 50-80% or more depending on task mix, provider availability, and benchmark results.

Install (recommended)

bash

openclaw plugins install openmark-router
openclaw gateway restart

For normal published-plugin installs, that's it. The plugin auto-registers as a provider, sets openmark/auto as your default model, and starts routing.

Published package note:

the shipped runtime package includes compiled dist/ files plus the bundled Python router under scripts/
the Python router requires Python 3.8+ on the host

Install From Source (development)

Ignore this subsection unless you intentionally cloned this repository and want to run the plugin from source instead of installing the published plugin.

bash

npm install
npm run build
openclaw gateway restart

After any install or update, restart the OpenClaw gateway so the newly built plugin files are loaded.

If you change your real default model and want the router to use it for classification or passthrough, restart the gateway again so the plugin captures the new value.

Local Dashboard

The router now ships with a small local dashboard on the same embedded server.

Open http://127.0.0.1:2098/dashboard by default, or use your configured port.
It currently shows router health, detected providers, benchmark categories/freshness, and a small config summary.
It lets you edit routing_strategy and show_routing_card.
It can import benchmark CSVs directly into the configured benchmark directory and includes guidance for exporting those CSVs from OpenMark AI.
It can surface benchmark category descriptions directly from the dashboard.
It can delete imported benchmark CSVs from the configured benchmark directory.

30-Second Example

You send:

Write me a LinkedIn post about our product launch

The router classifies the task, checks your benchmark results, chooses the best model for that category, and you see:

text

Routed to gpt-5.4-nano (openai) — Content Creation Benchmark
Benchmark: 92.9% score  |  $0.002731/call  |  30.28s

Why this route: better score than gemini-3.1-pro, 97.6% cheaper, 4.2x faster
Over 10K calls: $27.31 vs $1148.36

Strategy: balanced  |  Benchmark data: fresh

[actual response from gpt-5.4-nano follows here...]

The routed model genuinely answers. The classifier does not generate the final user-visible answer.

Validated Surfaces

The seamless same-turn routing flow has been explicitly validated in:

OpenClaw CLI
OpenClaw Telegram

Other OpenClaw surfaces may also work when they pass through the same provider and hook flow, but they have not been exhaustively validated yet.

Why This Is Useful

OpenMark Router helps you benchmark your real tasks, find better-fit models for them, and often reduce spend substantially. The biggest win usually comes from stopping the habit of sending every request to one expensive flagship model when your own benchmark data shows a smaller or cheaper model is already good enough for that category.

Benchmark-driven, not heuristic routing: task selection is based on semantic classification plus benchmark results on your own tasks, not on a simplistic simple vs complex split.
Better model choice without manual switching: the router picks the best benchmarked candidate and fallbacks on the fly.
Routing adapts to what the user actually has configured: it detects available providers/hosts, prefers direct provider keys first, and can fall back to OpenRouter keys when benchmark rows include them and OpenRouter is available.
Single-provider setups still benefit: you do not need a multi-provider stack. You can benchmark and route within one provider only if that is what your OpenClaw install uses.
Classifier cost can stay negligible: the classifier call is isolated and lightweight.
A safe default path still exists: Messages that don't have corresponding benchmarked tasks continue through the passthrough/default model path.
Routing cards add useful visibility: users can see which model was chosen, why, and when routing happened.
OpenClaw-owned auth flow: users do not hand provider API keys directly to the plugin. OpenClaw handles provider authentication, request formatting, and model execution.
Your real default model is preserved: the plugin captures your existing OpenClaw default model before switching the runtime default to openmark/auto.

Quick Start

Benchmark your recurring tasks on OpenMark AI (100+ models)
Export -- click Export -> OpenClaw on the Results tab
Place the CSV in ~/.openclaw/workspace/plugins/openmark-router/benchmarks/
Done -- the router activates automatically

You can still place CSVs manually, but the local dashboard now also provides a CSV import flow with OpenMark export guidance.

How It Works

The plugin uses an internal two-phase architecture: Phase 1 classifies and routes, Phase 2 generates the real reply from the optimal model. To the user, this still appears as a single reply. OpenClaw handles provider authentication and API formatting, and the plugin does not make direct provider API calls itself.

text

Turn 1 — Classification & Routing
──────────────────────────────────
User sends message
    |
    v
OpenClaw routes to openmark/auto (registered provider)
    |
    v
Embedded server classifies via OpenClaw gateway loopback:
  - Sends ONLY the current user message + category names
  - Uses your configured classifier model, or your captured default model if unset
  - Returns the matching category name
    |
    v
Deterministic model selection (router.py, ~60ms after classification):
  - Loads benchmark data, ranks models by strategy
  - Computes optimal model + fallbacks for this run
  - Returns pre-formatted routing card
    |
    v
OpenClaw immediately runs the real reply with the routed model stack
  - Full session context, system prompt, conversation history
  - Authentication and streaming handled by OpenClaw
    |
    v
User receives one reply containing the routing card plus the answer from the best model for their task
    |
    v
Compatibility fallback only: if the internal rerun path is unavailable, the plugin can still persist the route and ask for a follow-up message

No direct provider API calls. Classification goes through the OpenClaw gateway. The seamless hot path uses in-memory model overrides for the internal rerun, and the compatibility fallback can still write OpenClaw-managed routing state when needed. Provider authentication and model execution remain inside OpenClaw.

Runtime Access

The plugin is not a passive benchmark viewer. To implement same-turn routing and compatibility fallback behavior, it intentionally does the following:

executes the bundled scripts/router.py via a local Python subprocess
reads and updates ~/.openclaw/openclaw.json when injecting the provider and when compatibility fallback needs a persisted routed model stack
reads and updates ~/.openclaw/agents/main/sessions/ state for temporary session model binding and snapshot continuity during same-turn routed turns
temporarily overrides runtime model selection inside OpenClaw hooks for the duration of a routed reply

What it does not do:

it does not ask users to paste provider API keys into plugin-specific settings
it does not make outbound provider API requests itself; OpenClaw handles provider auth/execution
the Python routing engine does not depend on third-party pip packages and does not need external network calls to rank benchmark rows

What Happens Per Message

Route match: the plugin classifies the message, chooses the benchmark winner plus fallbacks, and the routed model answers in the same visible reply flow.
No route match: the plugin keeps the message on the passthrough/default path. The answer still happens on the same turn.
Short messages: very short messages skip the classifier and stay on the passthrough/default path.
Slash commands and internal OpenClaw prompts: bypass routing completely.
Compatibility fallback: only used when the seamless internal rerun path is unavailable.

Why Custom Benchmarking Matters

Every routing solution -- from complexity tiering to auto-classifiers -- does blanket categorization. This breaks because:

Generic classification is too broad: "email tasks" lumps cold outreach, complaint triage, and legal notices together. Model performance varies dramatically across these subtypes.
Generic benchmarks are equally broad: MMLU, Arena Elo, and HumanEval test general capabilities. A model scoring well on "writing" tells you nothing about your email templates with your tone requirements.
Real cost is invisible in pricing: Published $/M token rates are misleading. Tokenization differs, chain-of-thought inflates output tokens. A model at $0.60/M can cost more per call than one at $3/M.

When you benchmark on OpenMark AI, you test models on your specific task, with your prompts, against your criteria.

Routing Engine

The full system is not a heuristic router. An LLM first performs semantic task classification on the current message. After that category is known, the model-selection step is fully deterministic and uses benchmark data plus the selected strategy to rank the available candidates. The deterministic model-selection phase itself completes in ~60ms.

6-Step Cascade Sort

Incomplete models pushed to bottom
Score descending
Accuracy per dollar descending
Accuracy per minute descending
Cost ascending
Model name alphabetical (deterministic tiebreaker)

Five Strategies

Strategy	What It Optimizes
`balanced`	Weighted: accuracy (40%) + cost-efficiency (20%) + speed (25%) + stability (15%)
`best_score`	Highest benchmark accuracy
`best_cost_efficiency`	Best accuracy per dollar among viable models
`best_under_budget`	Highest score within your cost ceiling
`best_under_latency`	Highest score within your latency ceiling

Viability Floor

floor = max(top_score - 15pp, top_score * 0.5) -- models below the floor are excluded from routing.

Fallbacks

The router provides ranked fallback models from the same benchmark. On the seamless hot path, those fallbacks are injected into the effective OpenClaw model config for that reply run. If the internal rerun path is unavailable, the compatibility fallback still persists the routed primary + fallbacks to OpenClaw's config.

Configuration

Edit config.json in the plugin directory.

config.json is the source of truth for routing behavior. The plugin metadata schema exposes the core knobs, but the full router configuration lives in config.json. The local dashboard can edit a small subset of those settings, currently routing_strategy and show_routing_card.

Field	Default	Description
`classifier_model`	`""` (your default model)	Override: model for the isolated classification call. Leave empty to use your captured default model.
`no_route_passthrough`	`""` (your default model)	Override: model for unrouted messages and short-message passthrough. Leave empty to use your captured default model.
`routing_strategy`	`balanced`	Default ranking strategy
`port`	`2098`	Embedded server port
`gateway_port`	`18789`	OpenClaw gateway port (for classification calls via loopback)
`show_routing_card`	`true`	Prepend routing card to responses
`restore_delay_s`	`30`	Fallback restore timer (used only if hook-based restore is unavailable)
`benchmarks_dir`	`benchmarks`	Benchmark directory relative to the plugin root
`cost_ceiling`	`null`	Max cost per run (for `best_under_budget`)
`latency_ceiling_s`	`null`	Max seconds per run (for `best_under_latency`)
`freshness_warning_days`	`30`	Warn when benchmark data is older than this
`min_completion_pct`	`80`	Skip models below this completion rate
`min_stability_threshold`	`10.0`	Skip models with variance above this
`fallback_count`	`2`	Number of fallback models

By default, the router uses your existing default model for both classification and passthrough. The plugin captures your default model on startup (before setting openmark/auto). The TypeScript plugin reads the runtime-facing fields such as ports, classifier/passthrough models, routing-card display, and restore timing. The Python routing engine reads the benchmark-selection fields such as strategy, ceilings, freshness, benchmark directory, stability filters, and fallback count from the same config.json.

Classifier Recommendations

The plugin does not prefer expensive models for classification. The classifier call is intentionally isolated and lightweight, so a small cheap model is usually the right choice.

Good classifier candidates include:

google/gemini-3.1-flash-lite-preview
openai/gpt-5.4-nano
openai/gpt-5.4-mini
anthropic/claude-haiku-4-5

Other models can work too. The important rule is: use a model id that OpenClaw itself can resolve.

For example, this mattered during validation:

anthropic/claude-haiku-4-5 worked
anthropic/claude-haiku-4.5 failed because OpenClaw did not recognize that id

If you leave classifier_model empty, the plugin uses your captured default model instead.

Provider Access And OpenRouter Fallback

The router does not blindly trust every model listed in a benchmark CSV.

It first tries to detect which providers/hosts your OpenClaw install can currently use.
When a benchmark row includes both a direct provider key and an OpenRouter key, the router prefers the direct provider key first.
If that direct provider is not available but openrouter is available, the router falls back to the OpenRouter key for that same model.

In practice, this means:

direct provider access is preferred when you already have it configured
OpenRouter can expand coverage when the benchmark includes OC OR Key values and your OpenClaw setup has OpenRouter access
users do not need every benchmarked model to be available through direct APIs if an OpenRouter path exists for those rows
users can also benchmark only one provider and still get useful routing within that provider's model lineup

This also means the router is still useful when your setup is based on subscriptions, hosted access, OAuth-backed providers, or even free-model-heavy setups, as long as OpenClaw can execute the model ids involved. On many of those setups, flagship models also tend to be the first ones to feel expensive, constrained, or practically rate-limited, so routing can stretch the value of the access you already have by reserving premium models for the tasks that actually need them.

Important billing caveat:

the router can improve model choice and often reduce spend substantially based on your benchmark data
in some workloads, especially when teams are overusing flagship models for routine tasks, benchmark-driven routing can reduce model spend by around 50-80% or more
provider billing, quota, and rate-limit behavior still depend on the provider and account type, so treat the exact savings as workload-dependent rather than guaranteed

Important limitation:

provider detection is best-effort
it is based on what OpenClaw reports for configured providers/hosts
exact model execution still depends on OpenClaw accepting the exact model id string

So provider/host access is the first filter, but canonical model ids still matter. A provider may be available while a specific model string is still rejected if the id is not the exact OpenClaw-supported form.

Troubleshooting Notes

Classifier Fallback Chain

The normal classifier path is:

isolated simple-completion call through OpenClaw
subagent fallback only if that isolated path fails

The subagent path exists as a resilience fallback, not as the preferred architecture. If you see subagent-related classifier logs, treat that as a troubleshooting clue that the isolated simple-completion path could not prepare or execute the selected classifier model cleanly.

Provider Detection Commands

These are the commands used to inspect provider/host availability:

bash

openclaw models status --json
python scripts/router.py --detect-providers --force-detect

Run the Python command from the plugin root:

bash

cd /path/to/openmark-router
python scripts/router.py --detect-providers --force-detect

Or, without changing directories, pass an absolute path to the script:

bash

python /absolute/path/to/openmark-router/scripts/router.py --detect-providers --force-detect

Dashboard Scope

Current dashboard:

router health/version
detected providers/hosts currently available in OpenClaw
active benchmark categories and freshness visibility
routing_strategy selection
show_routing_card toggle
benchmark CSV import into the configured benchmarks directory
benchmark CSV deletion from the configured benchmarks directory
category-description modal from the loaded benchmark metadata

Still planned for later:

manual routed-model lock/unlock
richer OpenMark-branded UX and broader controls

CSV Format

Use Export -> OpenClaw on OpenMark AI. The CSV includes metadata headers and dual model keys:

csv

# task_name: email_classification
# display_name: Email Classification Benchmark
# description: Classifies emails by intent, priority, and category.
"Model","Provider","OC Key","OC OR Key","Score (%)",...

OC Key: direct provider model key (e.g., openai/gpt-5.4)
OC OR Key: OpenRouter model key (e.g., openrouter/openai/gpt-5.4)

Project Structure

text

openmark-router/
  src/
    index.ts               # Plugin entry: registerProvider + registerService
    server.ts              # Embedded HTTP server (OpenAI-compatible)
    classifier.ts          # Isolated LLM classifier
    provider-inject.ts     # Auto-configure provider in openclaw.json
    router-bridge.ts       # TypeScript bridge to router.py subprocess
    types.ts               # Shared types
  scripts/
    router.py              # Core routing engine (1200+ lines)
    loader.py              # OpenMark CSV parser
    adapter.py             # Model ID translation
  dist/                    # Compiled JS
  benchmarks/
    examples/
      chatbot_potential.csv  # Sample CSV
  config.json              # Routing configuration
  package.json             # Plugin package
  openclaw.plugin.json     # Plugin metadata
  clawhub.json             # ClawHub metadata
  LICENSE                  # Apache-2.0

Trust and Security

OpenClaw-managed credentials: users do not provide provider API keys directly to the plugin, and all model calls go through the OpenClaw gateway/runtime.
Clean install: Python stdlib only for routing engine. No pip dependencies.
Declared local writes: The plugin may update openclaw.json plus main-session state under ~/.openclaw/agents/main/sessions/ to preserve same-turn routing continuity and compatibility fallback behavior.
No network requests from routing engine: All benchmark data is local CSV files.
Isolated classification: The classifier call uses only the current user message and category names -- no main-session history, no routed-model context, no borrowed system prompt state.
No provider preference: Classification and passthrough can use any model that OpenClaw can resolve. Small cheap models are encouraged.
Best-effort provider filtering: The router attempts to detect which providers/hosts are available in OpenClaw and filters benchmark candidates accordingly. Direct provider keys are preferred first, with OpenRouter fallback when available for the same row.

Requirements

Python 3.8+ (stdlib only)
Node.js 18+ (for plugin runtime)
OpenClaw 2026.3.28+

License

The repository is currently Apache-2.0 licensed -- see LICENSE.

Additional repo notes:

TRADEMARK.md: OpenMark brand and affiliation note

OpenMark AI Model Router

Install

OpenMark AI Model Router

Install (recommended)

Install From Source (development)

After any install or update, restart the OpenClaw gateway so the newly built plugin files are loaded.

Local Dashboard

30-Second Example

Validated Surfaces

Why This Is Useful

Quick Start

How It Works

Runtime Access

What Happens Per Message

Why Custom Benchmarking Matters

Routing Engine

6-Step Cascade Sort

Five Strategies

Viability Floor

Fallbacks

Configuration

Classifier Recommendations

Provider Access And OpenRouter Fallback

Troubleshooting Notes

Classifier Fallback Chain

Provider Detection Commands

Dashboard Scope

CSV Format

Project Structure

Trust and Security

Requirements

Links

License