QuantClaw

v2026.4.11

Task-type based quantization routing plugin for OpenClaw that routes requests across 4bit, 8bit, and 16bit targets to balance cost and accuracy.

@sparkengineai/quantclaw·runtime quantclaw·by @jefflee1874

Code Pluginsource linkedCommunity code plugin. Review compatibility and verification before install.

README

QuantClaw: Precision Where It Matters for OpenClaw

[Blog] | [GitHub] | [Paper] (Coming soon)

QuantClaw is a plug-and-play task-type routing quantization plugin for OpenClaw. It classifies each incoming request, maps it to a precision tier (4bit, 8bit, or 16bit), and routes the request to the right model target so you can balance quality, latency, and cost without asking users to choose precision manually.

🚀 Quick Start

Install

# Prerequisite: OpenClaw is already installed.

# Install from Clawhub (recommended)
openclaw plugins install clawhub:@sparkengineai/quantclaw

# If OpenClaw is running from a source checkout and the CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install @sparkengineai/quantclaw

# Or install from source
git clone https://github.com/SparkEngineAI/QuantClaw-plugin.git ./quantclaw
openclaw plugins install ./quantclaw

# If the OpenClaw CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install /path/to/quantclaw

Create or bootstrap the runtime config

QuantClaw reads its runtime config from:

~/.openclaw/quantclaw.json

If the file does not exist, starting OpenClaw with the plugin enabled will generate a default quantclaw.json. If you are working from this repository directly, you can also start from the provided example:

cp config.example.json ~/.openclaw/quantclaw.json

Edit the detector chain and targets

{
  "quant": {
    "enabled": true,
    "detectors": ["ruleDetector", "loadModelDetector"],
    "judge": {
      "endpoint": "http://127.0.0.1:8000",
      "model": "BAAI/bge-m3",
      "providerType": "openai-compatible",
      "apiKey": "",
      "cacheTtlMs": 300000
    }
  }
}

Start OpenClaw and open the dashboard

http://127.0.0.1:18789/plugins/quantclaw/stats

⚙️ Configuration Notes

The runtime schema supports:

ordered detectors: ruleDetector, loadModelDetector
per-task-type id, description, precision, keywords, and patterns
per-tier model targets with independent provider, model, endpoint, api key, and pricing
model-level pricing overrides for cost reporting
hot reload when ~/.openclaw/quantclaw.json changes

Example taskTypes config:

{
  "taskTypes": [
    {
      "id": "coding",
      "precision": "16bit",
      "description": "code review, bug analysis, implementation, debugging, kernels, async behavior, web development",
      "keywords": ["code", "debug", "bug", "Python", "CUDA", "编程", "代码"],
      "patterns": [
        "fix the bug in this repository",
        "(?=.*(?:refactor|重构))(?=.*(?:typescript|ts|node)).*"
      ]
    }
  ],
  "defaultTaskType": "standard"
}

Example targets config:

{
  "targets": {
    "4bit": {
      "provider": "quantclaw-4bit",
      "model": "glm-4.7-flash-int4-autoround",
      "endpoint": "https://api.example.com/v1",
      "apiKey": "${QC_4BIT_API_KEY}",
      "displayName": "4-bit Target",
      "pricing": {
        "inputPer1M": 0.051,
        "outputPer1M": 0.34
      }
    },
    "16bit": {
      "provider": "quantclaw-16bit",
      "model": "glm-4.7-flash",
      "endpoint": "https://api.openai.com/v1",
      "apiKey": "${QC_16BIT_API_KEY}",
      "displayName": "16-bit Target",
      "pricing": {
        "inputPer1M": 0.06,
        "outputPer1M": 0.4
      }
    }
  }
}

Example modelPricing overrides:

{
  "modelPricing": {
    "glm-4.7-flash": {
      "inputPer1M": 0.06,
      "outputPer1M": 0.4
    },
    "glm-4.7-flash-int4-autoround": {
      "inputPer1M": 0.051,
      "outputPer1M": 0.34
    }
  }
}

Target-level pricing is used first for that precision tier. If it is absent, QuantClaw falls back to modelPricing for cost reporting.

🧠 `loadModelDetector` Backends

loadModelDetector supports either a local embedding-based router exposed through an OpenAI-compatible API or a regular OpenAI-compatible LLM judge.

Build a local embedding router index:

python router/embedding_task_router.py --model-name BAAI/bge-m3 --device cuda --config-path ~/.openclaw/quantclaw.json --output-dir ./embedding_router_index-bge-m3 build --print-summary

Serve that router as an OpenAI-compatible endpoint:

python router/embedding_task_router_server.py --model-name BAAI/bge-m3 --device cuda --output-dir ./embedding_router_index-bge-m3 --port 8012

If your machine does not have a GPU, change --device cuda to --device cpu.

If you do not want to run the local embedding router, you can point quant.judge.endpoint at any OpenAI-compatible LLM endpoint instead.

🙏 Acknowledgements

We especially acknowledge:

👥 Core Contributors

Manyi Zhang, Ji-Fu Li*, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai (Project Lead), Xiaobo Xia

📖 Citation

If QuantClaw helps your research, engineering work, or benchmark studies, please cite:

@misc{QuantClawBlog,
    title = {QuantClaw: Precision Where It Matters for OpenClaw},
    url = {https://sparkengineai.github.io/QuantClaw/},
    author = {SparkEngineAI Team},
    month = {April},
    year = {2026}
}

Capabilities

Tags: executes-code
configSchema: Yes
Executes code: Yes
HTTP routes: 0
Runtime ID: quantclaw

Compatibility

Built With Open Claw Version: >=2026.3.22
Plugin Api Range: >=2026.3.22

Verification

Tier: source linked
Scope: artifact only
Summary: Validated package structure and linked the release to source metadata.
Source: github.com/SparkEngineAI/QuantClaw-plugin
Commit: ce3bd4c14dcb
Tag: ce3bd4c14dcbdbfbcbce6c04de6b0d24bfe25fba
Provenance: No
Scan status: suspicious