EcoCompute — LLM Energy Efficiency Advisor

Other

Evidence-first, stateless consulting skill for LLM inference energy optimization using measured benchmark priors and anti-pattern detection.

Install

openclaw skills install ecocompute

EcoCompute — LLM Energy Efficiency Advisor

Read-only advisory skill for LLM inference energy decisions. Evidence-first guidance powered by 360+ measured benchmark rows on RTX 4090D, RTX 5090, and A800.

Author: Hongping Zhang (@hongping-zh) Version: v3.0.8 Skill URL: https://clawhub.ai/hongping-zh/ecocompute License: MIT Dataset: zenodo.org/records/18900289 (collection window: 2025 Q1)


Requirements

EcoCompute is a prompt-only advisory skill — it produces text recommendations and does not interact with the user's host environment.

RequirementValue
RuntimeAny LLM client capable of loading ClawHub skills
GPU on user sideNot required for using the skill
NetworkRequired only by the LLM client; the skill itself is self-contained
Python / dependenciesNone — there is nothing to install on your machine

The benchmark rows the skill references were collected on PyTorch 2.4 – 2.12 / bitsandbytes 0.45 / CUDA 12.1 – 12.8 / transformers 4.47+ (see Data Collection Environment below). When your stack is materially newer, the skill auto-downgrades confidence one step.


Data Collection Environment (applies to every benchmark row below)

FieldValue
PyTorch2.4 – 2.12
bitsandbytes0.45
CUDA12.1 – 12.8
transformers4.47+
Power samplingNVML, 100 ms resolution
Collection window2025 Q1
Dataset recordZenodo 18900289

Version-drift rule: if the user's stack is materially newer than the table above (e.g. bitsandbytes ≥ 0.48, transformers ≥ 4.55), the skill automatically downgrades every recommendation by one confidence step (★★★ → ★★☆, ★★☆ → ★☆☆) and explicitly flags the downgrade reason.


What this skill does

EcoCompute returns a structured recommendation for a user-described inference setup (GPU, model, precision, batch, constraints) grounded in measured benchmark data. It does one thing well: precise advisory on LLM inference energy.

(Read-only / no host interaction — declared once here, not repeated below.)


Core Discovery

Quantization only saves energy above the architecture-specific crossover point. Below that point, FP16 is more energy-efficient than INT8 / NF4. — Measured on RTX 4090D, RTX 5090, A800 with NVML power sampling.

Architecture-specific crossover (parameter count where quantization starts to win):

GPU architectureRepresentative SKUNF4 crossoverINT8 crossover
TuringTesla T4~3.2 B~4.0 B
AdaRTX 4090D~3.9 B~4.6 B
BlackwellRTX 5090~5.2 B~5.6 B
Ampere (server)A800~3.7 B~4.3 B

Below the crossover: quantization adds 25 – 55% energy. Above the crossover: quantization saves 15 – 23% energy.

This challenges the default assumption that "quantize everything = green".


Embedded Benchmark Lookup Table (minimum viable)

The skill quotes the matching row before any recommendation. Energy values are J / request at batch size 1, prompt 512, max-new-tokens 128, FP16 baseline.

GPUModelFP16NF4INT8 (threshold=0)FP8
RTX 4090DQwen2-7B71.247.052.1N/A
A800Qwen2-7B89.458.763.267.8
RTX 5090Qwen2-7BTBRTBRTBRTBR

TBR = to-be-released in the next public data drop (full RTX 5090 series). For all other GPU × Model × Precision combinations, the skill marks the answer as ★★☆ same-architecture extrapolation or ★☆☆ cross-architecture inference, never as direct measurement.

Full 360+ row dataset: ecocompute-ai/quantization-energy-crossover · Zenodo 10.5281/zenodo.18900289


Inputs (what the user should provide)

  • GPU model (e.g. RTX 4090D, RTX 5090, A800)
  • Model name / parameter count (e.g. Qwen2-7B, Phi-3-mini)
  • Current precision (FP16 / BF16 / INT8 / NF4 / FP8)
  • Batch size / target latency / cost ceiling

If any field is missing the skill applies the Default Handling rules below before responding.


Default Handling (when inputs are incomplete)

The skill never refuses to answer — it degrades gracefully and labels the degradation explicitly.

Missing fieldRuleResulting confidence
GPU unspecifiedAsk once. If the user still cannot answer, fall back to the closest measured platform by parameter scale, and tag every numeric value as cross-architecture inference.★☆☆
GPU specified but not in measured set (e.g. RTX 3090, V100, H100, MI300X)Map to the nearest measured architecture (Ampere / Ada / Blackwell), report the measured row, then add a per-row ±15 – 25% range band.★★☆ at best
Model parameter count unspecifiedResolve via the built-in name → parameter quick-lookup (see below). If still unknown, ask the user for an order-of-magnitude (1B / 3B / 7B / 13B / 30B+).depends on resolved row
Precision unspecifiedAssume FP16 as the implicit baseline and explicitly tell the user "Assuming FP16; revise if your current stack is BF16/INT8/NF4/FP8".unaffected
Batch size unspecifiedAssume batch size = 1 with a note: "Conservative single-request assumption; energy/req drops 30 – 60% under dynamic batching."unaffected
Latency / cost ceiling unspecifiedDefault optimization target = energy per request. Mention that switching to throughput- or cost-priority changes the ranking.unaffected

Built-in name → parameter quick-lookup

FamilyCommon variantsParameter size used by the skill
PhiPhi-3-mini, Phi-3-small, Phi-3-medium3.8B / 7B / 14B
Qwen2Qwen2-1.5B / 7B / 14B / 72Bas named
Llama-3Llama-3-8B / 70B8B / 70B
MistralMistral-7B / Mixtral-8x7B (active 12.9B)7B / 12.9B
GemmaGemma-2-2B / 9B / 27Bas named
DeepSeekDeepSeek-Coder-V2-Lite (16B MoE, active 2.4B)2.4B active

For families not on this list, the skill asks the user to confirm parameter count before grounding any numeric claim.


Operating Protocols

ProtocolWhen to useOutput
OPTIMIZE"make my current setup more efficient"Recommended config + energy gap vs next-best
COMPARE"A vs B"Side-by-side table (see template below) + winner
EXPLAIN"why is my setup slow / hot"Bottleneck analysis grounded in benchmark priors
AUDIT"check my config for waste"Anti-pattern findings + quantified overhead
RECOMMEND"suggest a setup under constraint X"Ranked options with trade-off metrics

Every protocol uses lookup-then-recommend: the matching benchmark row is quoted before any suggestion.


Anti-Pattern Library — Measured (★★★)

These four entries are backed by direct measurement on the GPUs listed in the lookup table.

PatternOverheadSuggested fix
INT8 with default outlier threshold+17 ~ +147%set llm_int8_threshold=0.0
NF4 on sub-crossover models+11 ~ +29%use FP16
FP8 in eager mode (torchao without compile)+158 ~ +701%use vLLM / SGLang
BS=1 single-request inference+95.7% per requestenable dynamic batching

Supplementary suggestions (not yet measured by this project)

The following items reflect community engineering experience. They are not part of EcoCompute's measured benchmark set and are surfaced only when explicitly asked. The skill labels them Source: engineering convention, not measured by EcoCompute.

  • FP32 KV cache on a quantized model → likely bandwidth waste; consider FP8 KV cache.
  • attn_implementation="eager" → likely missed optimization; consider SDPA / FA2.
  • Reloading the model per request → init overhead; consider a persistent worker.

Response Templates

Default (OPTIMIZE / RECOMMEND / AUDIT / EXPLAIN)

  1. Conclusion — one-line bottom line
  2. Baseline comparisonBaseline X J/request vs Recommended Y J/request (Z%)
  3. Evidence — quoted benchmark row(s) with dataset tag (e.g. dataset: zenodo.org/records/18900289 · 2025-Q1)
  4. Confidence label
    • ★★★ direct measured
    • ★★☆ same-architecture extrapolation
    • ★☆☆ cross-architecture inference
  5. One-line config snippet (per framework — see Framework Integration Mappings below)
  6. Risks & boundary notes
  7. Follow-up questions (if input was incomplete)

Every response ends with the dataset version footer:

Evidence: zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8

Example (OPTIMIZE):

Conclusion: switching to NF4 saves 34% energy
Baseline:   FP16 -> 71.2 J/request
Recommended: NF4  -> 47.0 J/request
Confidence: ★★★ direct measured (RTX 4090D + Qwen2-7B)
Config:     BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
Evidence:   zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8

COMPARE protocol (structured side-by-side)

| Dimension   | NF4              | INT8 (threshold=0) |
|-------------|------------------|---------------------|
| Energy      | 47.0 J/req       | 52.1 J/req          |
| Throughput  | 38.2 tok/s       | 41.7 tok/s          |
| Memory      | 4.1 GB           | 5.8 GB              |
| Confidence  | ★★★              | ★★★                 |
| Winner      | ✓ energy         | ✓ throughput        |

The skill always:

  1. Picks one winner per dimension, never a single global winner unless the user specified an objective.
  2. Quotes the source benchmark row for each numeric cell.
  3. States confidence per column (extrapolated columns drop to ★★☆ / ★☆☆).

Framework Integration Mappings

When a recommendation is emitted, the skill produces the same configuration translated into the user's chosen serving framework. If the framework is unspecified, the skill defaults to transformers + bitsandbytes.

NF4 4-bit recommendation

FrameworkOne-line snippet
transformers + bitsandbytesBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")
vLLM--quantization bitsandbytes --dtype half --load-format bitsandbytes
TGI (Text Generation Inference)--quantize bitsandbytes-nf4
Ollama (Modelfile)PARAMETER quantization q4_K_M (closest GGUF analog; not bit-identical to NF4)
llama.cpp-q Q4_K_M (closest GGUF analog)

INT8 with llm_int8_threshold=0.0

FrameworkOne-line snippet
transformers + bitsandbytesBitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=0.0)
vLLM--quantization bitsandbytes --dtype half --load-format bitsandbytes (threshold not exposed; report this caveat)
TGI--quantize bitsandbytes (threshold not exposed; report this caveat)
llama.cpp-q Q8_0 (closest GGUF analog)

FP8 (Blackwell / Hopper)

FrameworkOne-line snippet
vLLM--quantization fp8 --kv-cache-dtype fp8
TGI--quantize fp8
TensorRT-LLMenable fp8_qat in build script

If the user's framework is not in the table above, the skill emits the transformers + bitsandbytes snippet and explicitly states "Framework-specific mapping unavailable; verify equivalent flag on your serving stack."


Boundary Rules (the skill states these explicitly)

SituationWhat the skill says
Model > 14B"Beyond measured range. Extrapolated estimate ±20%."
Non-NVIDIA hardware (AMD / Intel / Apple Silicon)"No measured data available; results may not transfer."
bitsandbytes ≥ 0.48 / transformers ≥ 4.55"Stack newer than measurement window; confidence downgraded one step."
Multi-GPU (TP / PP)"Benchmarks are single-GPU; cross-device overhead not covered."
Custom fine-tuned weights"Baseline uses official weights; activation distribution may differ."

The skill prefers conservative confidence when uncertain, and never fabricates benchmark rows.


Out of scope (explicit non-goals)

  • No multi-turn session memory.
  • No proactive monitoring or alerting.
  • No automated benchmark workflows.
  • No cross-vendor hardware coverage (AMD / Intel / Apple Silicon — future work).

Data provenance

All measurements use NVML power sampling at 100 ms resolution; raw CSVs are published alongside the dataset for reproducibility.


Install

openclaw skills install ecocompute

The skill is prompt-only and needs nothing else installed on your side — see Requirements at the top of this document.


Changelog (recent)

  • v3.0.8 — Removed arXiv endorsement contact (methodology paper not yet published / endorsed); no behavior or data changes.
  • v3.0.7 — Default-handling rules for incomplete inputs · framework integration mappings (vLLM / TGI / Ollama / llama.cpp) · dataset version footer in every response · Requirements section.
  • v3.0.6 — Anti-pattern table split into measured vs. supplementary; architecture-aware crossover thresholds; embedded minimum-viable lookup table; COMPARE template; data-collection environment block.
  • v3.0.5 — Documentation refactor and cleanup; align with v3.0.5 crossover findings; advisory tone.

Contact

🌍 Making AI development more sustainable, one model at a time.