Install
openclaw skills install turboquant-plusTurboQuant+ compresses llama.cpp KV caches on Apple Silicon up to 6.4x with minimal quality loss, enabling larger models and longer contexts efficiently.
openclaw skills install turboquant-plusAccelerate local LLM inference on Apple Silicon with 3.8-6.4x KV cache compression via PolarQuant + Walsh-Hadamard rotation.
量化, KV压缩, 本地推理, llama.cpp, turboquant, KV cache, compression, Apple Silicon, Metal, turbo2, turbo3, turbo4
TurboQuant+ implements TurboQuant (ICLR 2026) for llama.cpp with Metal GPU kernels. It compresses the transformer KV cache to squeeze larger models and longer contexts into limited Apple Silicon memory — with minimal quality loss.
None. Works with the llama.cpp TurboQuant fork.
# Recommended default — turbo4 symmetric
llama-server -m model.gguf --cache-type-k turbo4 --cache-type-v turbo4 -fa 1
# Maximum compression — turbo3 symmetric
llama-server -m model.gguf --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
# Extreme compression — turbo2 (best with asymmetric)
llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v turbo2 -fa 1
Some low-bit weight models degrade with symmetric turbo. Use asymmetric K/V:
# K stays at q8_0, V compressed with turbo
llama-server -m model-Q4_K_M.gguf --cache-type-k q8_0 --cache-type-v turbo4 -fa 1
# Even more V compression
llama-server -m model-Q4_K_M.gguf --cache-type-k q8_0 --cache-type-v turbo3 -fa 1
Note: Larger models (70B, 104B) handle symmetric turbo fine. Asymmetric mainly benefits smaller Q4_K_M models.
For 70B+ models at 32K+ context on 128GB Macs, raise the GPU memory cap:
# Set to 90% of 128GB
sudo sysctl iogpu.wired_limit_mb=117964
# Then run with turbo3 for maximum context
llama-server -m Llama-70B-Q4_K_M.gguf --cache-type-k turbo3 --cache-type-v turbo3 -c 65536 -fa 1
| Scenario | K cache | V cache | Compression | PPL impact |
|---|---|---|---|---|
| Best quality | turbo4 | turbo4 | 3.8x | +0.23% |
| Balanced | turbo3 | turbo3 | 4.6-5.1x | +1.06% |
| Max compression | turbo2 | turbo2 | 6.4x | +6.48% |
| Q4_K_M safe | q8_0 | turbo4 | ~3.8x V | +1.0% |
| Boundary V | q8_0 | turbo2 | ~6x V | 37-91% quality recovered |
| Cache | Compression | PPL | vs q8_0 |
|---|---|---|---|
| q8_0 | 1.9x | 6.111 | baseline |
| turbo4 | 3.8x | 6.125 | +0.23% |
| turbo3 | 4.6x | 6.176 | +1.06% |
| turbo2 | 6.4x | 6.507 | +6.48% |
| Model | Config | PPL | Context | NIAH |
|---|---|---|---|---|
| Llama-70B Q4_K_M | turbo4/turbo4 | 3.461 | 48K | 30/30 |
| Command-R+ 104B Q4_K_M | turbo3/turbo3 | 6.415 | 128K | 10/10 |
| KV | Prefill t/s | Decode t/s | vs q8_0 |
|---|---|---|---|
| q8_0 | 399.0 | 12.4 | — |
| turbo4 | 365.0 | 16.6 | +33.9% |