Task-type based quantization routing plugin for OpenClaw that routes requests across 4bit, 8bit, and 16bit targets to balance cost and accuracy.
README
QuantClaw: Precision Where It Matters for OpenClaw
[Blog] | [GitHub] | [Paper] (Coming soon)
QuantClaw is a plug-and-play task-type routing quantization plugin for OpenClaw. It classifies each incoming request, maps it to a precision tier (4bit, 8bit, or 16bit), and routes the request to the right model target so you can balance quality, latency, and cost without asking users to choose precision manually.
🚀 Quick Start
Install
# Prerequisite: OpenClaw is already installed.
# Install from Clawhub (recommended)
openclaw plugins install clawhub:@sparkengineai/quantclaw
# If OpenClaw is running from a source checkout and the CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install @sparkengineai/quantclaw
# Or install from source
git clone https://github.com/SparkEngineAI/QuantClaw-plugin.git ./quantclaw
openclaw plugins install ./quantclaw
# If the OpenClaw CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install /path/to/quantclaw
Create or bootstrap the runtime config
QuantClaw reads its runtime config from:
~/.openclaw/quantclaw.json
If the file does not exist, starting OpenClaw with the plugin enabled will generate a default quantclaw.json. If you are working from this repository directly, you can also start from the provided example:
cp config.example.json ~/.openclaw/quantclaw.json
Edit the detector chain and targets
{
"quant": {
"enabled": true,
"detectors": ["ruleDetector", "loadModelDetector"],
"judge": {
"endpoint": "http://127.0.0.1:8000",
"model": "BAAI/bge-m3",
"providerType": "openai-compatible",
"apiKey": "",
"cacheTtlMs": 300000
}
}
}
Start OpenClaw and open the dashboard
http://127.0.0.1:18789/plugins/quantclaw/stats
⚙️ Configuration Notes
The runtime schema supports:
- ordered detectors:
ruleDetector,loadModelDetector - per-task-type
id,description,precision,keywords, andpatterns - per-tier model targets with independent provider, model, endpoint, api key, and pricing
- model-level pricing overrides for cost reporting
- hot reload when
~/.openclaw/quantclaw.jsonchanges
Example taskTypes config:
{
"taskTypes": [
{
"id": "coding",
"precision": "16bit",
"description": "code review, bug analysis, implementation, debugging, kernels, async behavior, web development",
"keywords": ["code", "debug", "bug", "Python", "CUDA", "编程", "代码"],
"patterns": [
"fix the bug in this repository",
"(?=.*(?:refactor|重构))(?=.*(?:typescript|ts|node)).*"
]
}
],
"defaultTaskType": "standard"
}
Example targets config:
{
"targets": {
"4bit": {
"provider": "quantclaw-4bit",
"model": "glm-4.7-flash-int4-autoround",
"endpoint": "https://api.example.com/v1",
"apiKey": "${QC_4BIT_API_KEY}",
"displayName": "4-bit Target",
"pricing": {
"inputPer1M": 0.051,
"outputPer1M": 0.34
}
},
"16bit": {
"provider": "quantclaw-16bit",
"model": "glm-4.7-flash",
"endpoint": "https://api.openai.com/v1",
"apiKey": "${QC_16BIT_API_KEY}",
"displayName": "16-bit Target",
"pricing": {
"inputPer1M": 0.06,
"outputPer1M": 0.4
}
}
}
}
Example modelPricing overrides:
{
"modelPricing": {
"glm-4.7-flash": {
"inputPer1M": 0.06,
"outputPer1M": 0.4
},
"glm-4.7-flash-int4-autoround": {
"inputPer1M": 0.051,
"outputPer1M": 0.34
}
}
}
Target-level pricing is used first for that precision tier. If it is absent, QuantClaw falls back to modelPricing for cost reporting.
🧠 loadModelDetector Backends
loadModelDetector supports either a local embedding-based router exposed through an OpenAI-compatible API or a regular OpenAI-compatible LLM judge.
Build a local embedding router index:
python router/embedding_task_router.py --model-name BAAI/bge-m3 --device cuda --config-path ~/.openclaw/quantclaw.json --output-dir ./embedding_router_index-bge-m3 build --print-summary
Serve that router as an OpenAI-compatible endpoint:
python router/embedding_task_router_server.py --model-name BAAI/bge-m3 --device cuda --output-dir ./embedding_router_index-bge-m3 --port 8012
If your machine does not have a GPU, change --device cuda to --device cpu.
If you do not want to run the local embedding router, you can point quant.judge.endpoint at any OpenAI-compatible LLM endpoint instead.
🙏 Acknowledgements
We especially acknowledge:
👥 Core Contributors
Manyi Zhang, Ji-Fu Li*, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai (Project Lead), Xiaobo Xia
📖 Citation
If QuantClaw helps your research, engineering work, or benchmark studies, please cite:
@misc{QuantClawBlog,
title = {QuantClaw: Precision Where It Matters for OpenClaw},
url = {https://sparkengineai.github.io/QuantClaw/},
author = {SparkEngineAI Team},
month = {April},
year = {2026}
}
Capabilities
- Tags
- configSchema
- Yes
- Executes code
- Yes
- HTTP routes
- 0
- Runtime ID
- quantclaw
Compatibility
- Built With Open Claw Version
- >=2026.3.22
- Plugin Api Range
- >=2026.3.22
Verification
- Tier
- source linked
- Scope
- artifact only
- Summary
- Validated package structure and linked the release to source metadata.
- Commit
- ce3bd4c14dcb
- Tag
- ce3bd4c14dcbdbfbcbce6c04de6b0d24bfe25fba
- Provenance
- No
- Scan status
- suspicious
Tags
- latest
- 2026.4.11
