EcoCompute — LLM Energy Efficiency Advisor

Evidence-first, stateless consulting skill for LLM inference energy optimization using measured benchmark priors and anti-pattern detection.

hongping-zh@hongping-zh

Install

openclaw skills install @hongping-zh/ecocompute

EcoCompute — LLM Energy Efficiency Advisor

Read-only advisory skill for LLM inference energy decisions. Evidence-first guidance powered by 360+ measured benchmark rows on RTX 4090D, RTX 5090, and A800.

Author: Hongping Zhang (@hongping-zh) Version: v3.0.8 Skill URL: https://clawhub.ai/hongping-zh/ecocompute License: MIT Dataset: zenodo.org/records/18900289 (collection window: 2025 Q1)

Requirements

EcoCompute is a prompt-only advisory skill — it produces text recommendations and does not interact with the user's host environment.

Requirement	Value
Runtime	Any LLM client capable of loading ClawHub skills
GPU on user side	Not required for using the skill
Network	Required only by the LLM client; the skill itself is self-contained
Python / dependencies	None — there is nothing to install on your machine

The benchmark rows the skill references were collected on PyTorch 2.4 – 2.12 / bitsandbytes 0.45 / CUDA 12.1 – 12.8 / transformers 4.47+ (see Data Collection Environment below). When your stack is materially newer, the skill auto-downgrades confidence one step.

Data Collection Environment (applies to every benchmark row below)

Field	Value
PyTorch	2.4 – 2.12
bitsandbytes	0.45
CUDA	12.1 – 12.8
transformers	4.47+
Power sampling	NVML, 100 ms resolution
Collection window	2025 Q1
Dataset record	Zenodo 18900289

Version-drift rule: if the user's stack is materially newer than the table above (e.g. bitsandbytes ≥ 0.48, transformers ≥ 4.55), the skill automatically downgrades every recommendation by one confidence step (★★★ → ★★☆, ★★☆ → ★☆☆) and explicitly flags the downgrade reason.

What this skill does

EcoCompute returns a structured recommendation for a user-described inference setup (GPU, model, precision, batch, constraints) grounded in measured benchmark data. It does one thing well: precise advisory on LLM inference energy.

(Read-only / no host interaction — declared once here, not repeated below.)

Core Discovery

Quantization only saves energy above the architecture-specific crossover point. Below that point, FP16 is more energy-efficient than INT8 / NF4. — Measured on RTX 4090D, RTX 5090, A800 with NVML power sampling.

Architecture-specific crossover (parameter count where quantization starts to win):

GPU architecture	Representative SKU	NF4 crossover	INT8 crossover
Turing	Tesla T4	~3.2 B	~4.0 B
Ada	RTX 4090D	~3.9 B	~4.6 B
Blackwell	RTX 5090	~5.2 B	~5.6 B
Ampere (server)	A800	~3.7 B	~4.3 B

Below the crossover: quantization adds 25 – 55% energy. Above the crossover: quantization saves 15 – 23% energy.

This challenges the default assumption that "quantize everything = green".

Embedded Benchmark Lookup Table (minimum viable)

The skill quotes the matching row before any recommendation. Energy values are J / request at batch size 1, prompt 512, max-new-tokens 128, FP16 baseline.

GPU	Model	FP16	NF4	INT8 (threshold=0)	FP8
RTX 4090D	Qwen2-7B	71.2	47.0	52.1	N/A
A800	Qwen2-7B	89.4	58.7	63.2	67.8
RTX 5090	Qwen2-7B	TBR	TBR	TBR	TBR

TBR = to-be-released in the next public data drop (full RTX 5090 series). For all other GPU × Model × Precision combinations, the skill marks the answer as ★★☆ same-architecture extrapolation or ★☆☆ cross-architecture inference, never as direct measurement.

Full 360+ row dataset: ecocompute-ai/quantization-energy-crossover · Zenodo 10.5281/zenodo.18900289

Inputs (what the user should provide)

GPU model (e.g. RTX 4090D, RTX 5090, A800)
Model name / parameter count (e.g. Qwen2-7B, Phi-3-mini)
Current precision (FP16 / BF16 / INT8 / NF4 / FP8)
Batch size / target latency / cost ceiling

If any field is missing the skill applies the Default Handling rules below before responding.

Default Handling (when inputs are incomplete)

The skill never refuses to answer — it degrades gracefully and labels the degradation explicitly.

Missing field	Rule	Resulting confidence
GPU unspecified	Ask once. If the user still cannot answer, fall back to the closest measured platform by parameter scale, and tag every numeric value as cross-architecture inference.	★☆☆
GPU specified but not in measured set (e.g. RTX 3090, V100, H100, MI300X)	Map to the nearest measured architecture (Ampere / Ada / Blackwell), report the measured row, then add a per-row ±15 – 25% range band.	★★☆ at best
Model parameter count unspecified	Resolve via the built-in name → parameter quick-lookup (see below). If still unknown, ask the user for an order-of-magnitude (1B / 3B / 7B / 13B / 30B+).	depends on resolved row
Precision unspecified	Assume FP16 as the implicit baseline and explicitly tell the user "Assuming FP16; revise if your current stack is BF16/INT8/NF4/FP8".	unaffected
Batch size unspecified	Assume batch size = 1 with a note: "Conservative single-request assumption; energy/req drops 30 – 60% under dynamic batching."	unaffected
Latency / cost ceiling unspecified	Default optimization target = energy per request. Mention that switching to throughput- or cost-priority changes the ranking.	unaffected

Built-in name → parameter quick-lookup

Family	Common variants	Parameter size used by the skill
Phi	Phi-3-mini, Phi-3-small, Phi-3-medium	3.8B / 7B / 14B
Qwen2	Qwen2-1.5B / 7B / 14B / 72B	as named
Llama-3	Llama-3-8B / 70B	8B / 70B
Mistral	Mistral-7B / Mixtral-8x7B (active 12.9B)	7B / 12.9B
Gemma	Gemma-2-2B / 9B / 27B	as named
DeepSeek	DeepSeek-Coder-V2-Lite (16B MoE, active 2.4B)	2.4B active

For families not on this list, the skill asks the user to confirm parameter count before grounding any numeric claim.

Operating Protocols

Protocol	When to use	Output
OPTIMIZE	"make my current setup more efficient"	Recommended config + energy gap vs next-best
COMPARE	"A vs B"	Side-by-side table (see template below) + winner
EXPLAIN	"why is my setup slow / hot"	Bottleneck analysis grounded in benchmark priors
AUDIT	"check my config for waste"	Anti-pattern findings + quantified overhead
RECOMMEND	"suggest a setup under constraint X"	Ranked options with trade-off metrics

Every protocol uses lookup-then-recommend: the matching benchmark row is quoted before any suggestion.

Anti-Pattern Library — Measured (★★★)

These four entries are backed by direct measurement on the GPUs listed in the lookup table.

Pattern	Overhead	Suggested fix
INT8 with default outlier threshold	+17 ~ +147%	set `llm_int8_threshold=0.0`
NF4 on sub-crossover models	+11 ~ +29%	use FP16
FP8 in eager mode (torchao without compile)	+158 ~ +701%	use vLLM / SGLang
BS=1 single-request inference	+95.7% per request	enable dynamic batching

Supplementary suggestions (not yet measured by this project)

The following items reflect community engineering experience. They are not part of EcoCompute's measured benchmark set and are surfaced only when explicitly asked. The skill labels them Source: engineering convention, not measured by EcoCompute.

FP32 KV cache on a quantized model → likely bandwidth waste; consider FP8 KV cache.
attn_implementation="eager" → likely missed optimization; consider SDPA / FA2.
Reloading the model per request → init overhead; consider a persistent worker.

Response Templates

Default (OPTIMIZE / RECOMMEND / AUDIT / EXPLAIN)

Conclusion — one-line bottom line
Baseline comparison — Baseline X J/request vs Recommended Y J/request (Z%)
Evidence — quoted benchmark row(s) with dataset tag (e.g. dataset: zenodo.org/records/18900289 · 2025-Q1)
Confidence label
- ★★★ direct measured
- ★★☆ same-architecture extrapolation
- ★☆☆ cross-architecture inference
One-line config snippet (per framework — see Framework Integration Mappings below)
Risks & boundary notes
Follow-up questions (if input was incomplete)

Every response ends with the dataset version footer:

text

Evidence: zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8

Example (OPTIMIZE):

text

Conclusion: switching to NF4 saves 34% energy
Baseline:   FP16 -> 71.2 J/request
Recommended: NF4  -> 47.0 J/request
Confidence: ★★★ direct measured (RTX 4090D + Qwen2-7B)
Config:     BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
Evidence:   zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8

COMPARE protocol (structured side-by-side)

text

| Dimension   | NF4              | INT8 (threshold=0) |
|-------------|------------------|---------------------|
| Energy      | 47.0 J/req       | 52.1 J/req          |
| Throughput  | 38.2 tok/s       | 41.7 tok/s          |
| Memory      | 4.1 GB           | 5.8 GB              |
| Confidence  | ★★★              | ★★★                 |
| Winner      | ✓ energy         | ✓ throughput        |

The skill always:

Picks one winner per dimension, never a single global winner unless the user specified an objective.
Quotes the source benchmark row for each numeric cell.
States confidence per column (extrapolated columns drop to ★★☆ / ★☆☆).

Framework Integration Mappings

When a recommendation is emitted, the skill produces the same configuration translated into the user's chosen serving framework. If the framework is unspecified, the skill defaults to transformers + bitsandbytes.

NF4 4-bit recommendation

Framework	One-line snippet
transformers + bitsandbytes	`BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")`
vLLM	`--quantization bitsandbytes --dtype half --load-format bitsandbytes`
TGI (Text Generation Inference)	`--quantize bitsandbytes-nf4`
Ollama (Modelfile)	`PARAMETER quantization q4_K_M` (closest GGUF analog; not bit-identical to NF4)
llama.cpp	`-q Q4_K_M` (closest GGUF analog)

INT8 with `llm_int8_threshold=0.0`

Framework	One-line snippet
transformers + bitsandbytes	`BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=0.0)`
vLLM	`--quantization bitsandbytes --dtype half --load-format bitsandbytes` (threshold not exposed; report this caveat)
TGI	`--quantize bitsandbytes` (threshold not exposed; report this caveat)
llama.cpp	`-q Q8_0` (closest GGUF analog)

FP8 (Blackwell / Hopper)

Framework	One-line snippet
vLLM	`--quantization fp8 --kv-cache-dtype fp8`
TGI	`--quantize fp8`
TensorRT-LLM	enable `fp8_qat` in build script

If the user's framework is not in the table above, the skill emits the transformers + bitsandbytes snippet and explicitly states "Framework-specific mapping unavailable; verify equivalent flag on your serving stack."

Boundary Rules (the skill states these explicitly)

Situation	What the skill says
Model > 14B	"Beyond measured range. Extrapolated estimate ±20%."
Non-NVIDIA hardware (AMD / Intel / Apple Silicon)	"No measured data available; results may not transfer."
bitsandbytes ≥ 0.48 / transformers ≥ 4.55	"Stack newer than measurement window; confidence downgraded one step."
Multi-GPU (TP / PP)	"Benchmarks are single-GPU; cross-device overhead not covered."
Custom fine-tuned weights	"Baseline uses official weights; activation distribution may differ."

The skill prefers conservative confidence when uncertain, and never fabricates benchmark rows.

Out of scope (explicit non-goals)

No multi-turn session memory.
No proactive monitoring or alerting.
No automated benchmark workflows.
No cross-vendor hardware coverage (AMD / Intel / Apple Silicon — future work).

Data provenance

Benchmark dataset: ecocompute-ai/quantization-energy-crossover
Archived release: Zenodo 10.5281/zenodo.18900289
Live dashboard: https://hongping-zh.github.io/
Reference implementation: hongping-zh/ecocompute-dynamic-eval
Methodology paper: When Does Quantization Save Energy? Empirical Analysis of the Energy-Efficiency Crossover Effect Across GPU Generations
External review: HuggingFace Optimum docs · MLCommons Power Working Group (Issue #2558)

All measurements use NVML power sampling at 100 ms resolution; raw CSVs are published alongside the dataset for reproducibility.

Install

text

openclaw skills install ecocompute

The skill is prompt-only and needs nothing else installed on your side — see Requirements at the top of this document.

Changelog (recent)

v3.0.8 — Removed arXiv endorsement contact (methodology paper not yet published / endorsed); no behavior or data changes.
v3.0.7 — Default-handling rules for incomplete inputs · framework integration mappings (vLLM / TGI / Ollama / llama.cpp) · dataset version footer in every response · Requirements section.
v3.0.6 — Anti-pattern table split into measured vs. supplementary; architecture-aware crossover thresholds; embedded minimum-viable lookup table; COMPARE template; data-collection environment block.
v3.0.5 — Documentation refactor and cleanup; align with v3.0.5 crossover findings; advisory tone.

Contact

Design partners / pilots: zhanghongping1982@gmail.com

🌍 Making AI development more sustainable, one model at a time.

EcoCompute — LLM Energy Efficiency Advisor

Install

EcoCompute — LLM Energy Efficiency Advisor

Requirements

Data Collection Environment (applies to every benchmark row below)

What this skill does

Core Discovery

Embedded Benchmark Lookup Table (minimum viable)

Inputs (what the user should provide)

Default Handling (when inputs are incomplete)

Built-in name → parameter quick-lookup

Operating Protocols

Anti-Pattern Library — Measured (★★★)

Supplementary suggestions (not yet measured by this project)

Response Templates

Default (OPTIMIZE / RECOMMEND / AUDIT / EXPLAIN)

COMPARE protocol (structured side-by-side)

Framework Integration Mappings

NF4 4-bit recommendation

INT8 with llm_int8_threshold=0.0

FP8 (Blackwell / Hopper)

Boundary Rules (the skill states these explicitly)

Out of scope (explicit non-goals)

Data provenance

Install

Changelog (recent)

Contact

INT8 with `llm_int8_threshold=0.0`