LLM Inference Performance Estimator

v1.0.0

Estimate LLM inference performance metrics including TTFT, decode speed, and VRAM requirements based on model architecture, GPU specs, and quantization format.

1· 107·0 current·0 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The skill's name/description (LLM inference performance estimator) matches the actions described in SKILL.md: parsing model configs, accepting GPU specs/quant formats, and computing TTFT/throughput/VRAM. It does not request unrelated binaries, credentials, or system config paths.
Instruction Scope
Runtime instructions stay within the stated purpose: they ask for a preset model name or a model config.json (user-pasted content or a local file path) and GPU specs. The only noteworthy behavior is that, if given a local file path, the agent is instructed to read that file to extract fields — which is necessary for the estimator but means the agent will access whatever file path the user supplies. The SKILL.md does not instruct the agent to fetch remote URLs itself (it suggests the user open HF/ModelScope links in a browser and paste the config).
Install Mechanism
No install spec or code files — instruction-only skill. This minimizes risk because nothing is downloaded or written to disk by the skill itself.
Credentials
The skill declares no required environment variables, credentials, or special config paths. The only inputs are user-supplied model config data and GPU specs, which are proportionate to estimation functionality.
Persistence & Privilege
always is false and the skill is user-invocable. disable-model-invocation is default (agent may invoke autonomously), which is the platform default and not excessive here. The skill does not request persistent system-wide changes or access to other skills' configs.
Assessment
This skill appears to do exactly what it says — estimate TTFT, decode speed, and VRAM from model and GPU specs — and it asks for no credentials or installs. A few practical cautions before use: - If you provide a local file path, the agent will read that file. Do not point it at unrelated sensitive files (e.g., ~/ .ssh, credentials files, or system configs). - Prefer copying and pasting only the model config.json contents (or sanitizing it) rather than giving a broad directory path. Configs typically do not contain secrets, but double-check before pasting. - The skill suggests visiting HF/ModelScope URLs in your browser and pasting config text; it does not fetch those URLs itself. If you prefer, provide model parameters manually instead of providing a file. - No environment variables or cloud credentials are requested, and there is no install step. If you see prompts later asking for secrets or for the skill to fetch remote resources, stop and verify why. Overall this skill is internally consistent and low-risk for the stated task; follow the above precautions about local file paths and pasted content.

Like a lobster shell, security has layers — review code before you run it.

latestvk979qre0523kbz94056qqbh31583gn0m
107downloads
1stars
1versions
Updated 3w ago
v1.0.0
MIT-0

LLM Inference Performance Estimator

Estimate TTFT (Time To First Token), decode speed (tokens/s), and VRAM usage for a given LLM on a specific GPU.

How to Use

The user may invoke this skill in several ways:

  1. Named model: /llm-perf-estimator Qwen2.5-7B RTX4090 2048 512 fp16
  2. With config file: /llm-perf-estimator config.json RTX4090 2048 512 int4
  3. Interactive: /llm-perf-estimator — ask the user step by step

Arguments (all optional, prompt for missing ones):

  • model — model name from preset list, or path to a HuggingFace config.json
  • gpu — GPU name from preset list, or custom specs
  • input_tokens — prefill sequence length (default: 1024)
  • output_tokens — number of tokens to generate (default: 256)
  • quant — quantization format: fp16, bf16, fp8, int8, int4 (default: fp16)

Step 1 — Resolve Model Architecture

Preset Models

If the user provides a known model name, use the following presets:

ModelTypeTotal ParamsActivated ParamsLayersHiddenHeads (Q)Heads (KV)FFN TypeIntermediateVocab
Qwen3.5-4BHybrid Dense4B4B32 (8 full+24 linear)256016 (full) / 16 (linear)4 (full)SwiGLU9216248320
Qwen3.5-35B-A3BHybrid MoE35B3B40 (10 full+30 linear)204816 (full) / 16 (linear)2 (full)SwiGLU+MoE8×512 per tok248320

If the model is not in the preset list and no config file is provided, ask the user to provide a config.json. They can get it without downloading the full model:

# ModelScope (browser)
https://modelscope.cn/models/{org}/{model}/file/view/master/config.json

# HuggingFace (browser)
https://huggingface.co/{org}/{model}/blob/main/config.json

Open the URL, copy the content, and paste it directly into the conversation. Alternatively, provide the local file path if the model is already downloaded.

If the user cannot provide a config, ask them to manually input:

  • num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads
  • intermediate_size, vocab_size
  • For MoE: num_experts, num_experts_per_tok, moe_intermediate_size

Parsing config.json

If the user provides a config.json path, read the file and extract:

num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads,
intermediate_size, vocab_size, model_type,
# MoE fields (if present):
num_experts / num_local_experts, num_experts_per_tok, moe_intermediate_size
# Hybrid attention (if present):
layer_types  ← list of strings, e.g. ["linear_attention", ..., "full_attention", ...]
head_dim     ← if explicitly provided, use it; otherwise head_dim = hidden_size / num_attention_heads

Determine num_full_attn_layers:

  • If layer_types exists: num_full_attn_layers = count of "full_attention" in layer_types
  • If layer_types is absent (standard transformer): num_full_attn_layers = num_hidden_layers

Note on nested configs (e.g. Qwen3.5-35B-A3B has a text_config wrapper):

  • If the top-level JSON has a text_config key, read all text model fields from inside it.
  • head_dim may be explicitly set (e.g. 256); prefer that over computing from hidden_size / num_attention_heads.

Note on tie_word_embeddings: if true, the embedding table and lm_head share the same weights. Do not count them twice in VRAM — the embedding contributes vocab_size × hidden_size × bytes_per_param only once.

Note on attn_output_gate: recognized but ignored in calculations — its contribution to FLOPs and VRAM is <1% and within the MFU uncertainty margin.


Step 2 — Resolve GPU Specs

Preset GPUs

GPUVRAM (GB)BF16 TFLOPSFP8 TFLOPSINT8 TOPSHBM BW (GB/s)
RTX 4060815.130.2272
RTX 4060 Ti1622.144.2288
RTX 40701229.158.2504
RTX 4070 Ti1240.180.2504
RTX 4070 Ti Super1640.180.2672
RTX 40801648.797.4717
RTX 4080 Super1652.2104.4736
RTX 40902482.6165.21008
RTX 5070 Ti16176.0352.0352.0896
RTX 508016225.0450.0450.0960
RTX 509032419.0838.0838.01792
A10G2431.262.5600
A100-40G4077.97311.91555
A100-80G8077.97311.92000
H100-SXM80989.41978.93958.03350
H100-PCIe80756.01513.03026.02000
H200-SXM141989.41978.93958.04800
L42430.360.6121.2300
L40S4891.6183.2366.4864
MI300X1921307.42614.95229.85300
Apple M4 (16GB)164.6120
Apple M4 Pro (48GB)489.2273
Apple M4 Max (128GB)12818.4546

If the GPU is not listed, ask the user to provide:

  • VRAM (GB)
  • BF16/FP16 TFLOPS
  • HBM bandwidth (GB/s)

Step 3 — Quantization Bytes Per Parameter

FormatBytes/paramCompute dtypeNotes
fp324.0fp32Rarely used for inference
bf16 / fp162.0bf16/fp16Baseline
fp81.0fp8Requires H100/H200/RTX50xx
int81.0int8W8A8 or W8A16
int40.5int4/fp16GPTQ/AWQ/bitsandbytes

Select the GPU TFLOPS column matching the compute dtype:

  • fp16/bf16 → BF16 TFLOPS
  • fp8 → FP8 TFLOPS (fall back to BF16 if not supported, with a warning)
  • int8 → INT8 TOPS
  • int4 → BF16 TFLOPS (dequant to fp16 for matmul in most frameworks)

Step 4 — Compute VRAM Requirements

4.1 Weight Memory

weight_bytes = total_params × bytes_per_param
weight_GB = weight_bytes / 1e9

For MoE models, total_params includes all expert weights (not just activated).

4.2 KV Cache Memory

Only full attention layers maintain a KV cache. Linear attention layers use a fixed-size recurrent state (negligible, ~tens of MB) that does not grow with sequence length.

kv_heads = num_key_value_heads          # from full attention config
kv_bytes_per_token = 2 × num_full_attn_layers × kv_heads × head_dim × bytes_per_param
kv_cache_GB = kv_bytes_per_token × (input_tokens + output_tokens) / 1e9

If num_full_attn_layers = num_hidden_layers (standard transformer), this reduces to the standard formula.

4.3 Activation Memory (prefill peak)

activation_GB ≈ num_layers × hidden_size × input_tokens × bytes_per_param × 2 / 1e9

This is an approximation; actual peak depends on framework and attention implementation.

4.4 Total VRAM

total_VRAM_GB = weight_GB + kv_cache_GB + activation_GB

Add a 15% overhead for framework buffers, CUDA context, etc.:

total_VRAM_GB_with_overhead = total_VRAM_GB × 1.15

Step 5 — Estimate TTFT (Prefill Latency)

Prefill is compute-bound for long sequences.

5.1 Attention FLOPs (prefill)

Only full attention layers have O(n²) attention compute. Linear attention layers are O(n) and their attention FLOPs are already captured in the projection FLOPs (Step 5.3).

attn_flops = 4 × num_full_attn_layers × input_tokens² × hidden_size

(factor of 4 = QK matmul + softmax + AV matmul, forward pass)

If num_full_attn_layers = num_hidden_layers, this is the standard transformer formula.

5.2 FFN FLOPs (prefill)

For SwiGLU/GeGLU (3 projections: gate, up, down):

ffn_flops = 3 × 2 × num_layers × input_tokens × hidden_size × intermediate_size

For MoE, replace intermediate_size with num_experts_per_tok × moe_intermediate_size.

5.3 QKV + Output Projection FLOPs

For full attention layers (standard QKV projections):

full_proj_flops = 2 × num_full_attn_layers × input_tokens × hidden_size
                  × (num_attention_heads × head_dim + 2 × kv_heads × head_dim + hidden_size)

For linear attention layers (also have Q/K/V-equivalent projections, but different dims):

linear_proj_flops = 2 × num_linear_attn_layers × input_tokens × hidden_size
                    × (linear_num_key_heads × linear_key_head_dim
                       + linear_num_key_heads × linear_key_head_dim
                       + linear_num_value_heads × linear_value_head_dim
                       + hidden_size)

If layer_types is absent (standard transformer), only full_proj_flops applies and num_linear_attn_layers = 0.

5.4 Total Prefill FLOPs

total_prefill_flops = attn_flops + ffn_flops + full_proj_flops + linear_proj_flops

5.5 TTFT

Apply MFU (Model FLOP Utilization) efficiency factor:

ScenarioMFU
Long prompt (>512 tokens), data center GPU0.45
Long prompt, consumer GPU0.35
Short prompt (<128 tokens)0.25
effective_tflops = gpu_tflops × MFU
TTFT_seconds = total_prefill_flops / (effective_tflops × 1e12)

Step 6 — Estimate Decode Speed

Decode is memory-bandwidth-bound at batch=1.

6.1 Bytes Read Per Decode Step

Each decode step reads:

  • All activated model weights once
  • KV cache for all previous tokens (full attention layers only; linear attention state is fixed-size and already loaded with weights)
activated_weight_bytes = activated_params × bytes_per_param
kv_cache_bytes_at_step = kv_bytes_per_token × (input_tokens + current_output_tokens)
bytes_per_step = activated_weight_bytes + kv_cache_bytes_at_step

For the average decode step, use current_output_tokens ≈ output_tokens / 2.

6.2 Decode Speed

Apply bandwidth utilization efficiency factor:

ScenarioBW Utilization
Data center GPU (HBM2e/HBM3)0.85
Consumer GPU (GDDR6X)0.75
Apple Silicon (unified memory)0.80
effective_bandwidth = gpu_bandwidth_GBs × bw_utilization
decode_speed_tps = effective_bandwidth × 1e9 / bytes_per_step

Step 7 — Output Report

Present results as a Markdown report with the following sections:

Section 1: Configuration Summary

ParameterValue
Model{model_name}
TypeDense / MoE / Hybrid MoE
Total Params{X}B
Activated Params{X}B
Total Layers{N}
Full Attention Layers{N} ({N} linear attention)
GPU{gpu_name}
VRAM Available{X} GB
Quantization{quant}
Input Tokens{N}
Output Tokens{N}

Section 2: VRAM Breakdown

ComponentSize (GB)
Model Weights{X}
KV Cache{X}
Activations (peak){X}
Framework Overhead (15%){X}
Total Required{X}
GPU Available{X}
Fits in VRAM?✅ Yes / ❌ No

If it doesn't fit, suggest:

  • A lower quantization format
  • Offloading options (CPU offload, disk offload)

Section 3: Performance Estimates

MetricEstimate
TTFT (Time to First Token){X} ms
Decode Speed{X} tokens/s
Time to Generate {N} tokens{X} s
Total End-to-End Latency{X} s

Section 4: Assumptions & Caveats

List the MFU and bandwidth utilization values used, and note:

  • Estimates assume batch_size=1, single GPU
  • Actual performance varies by framework (vLLM, llama.cpp, Ollama, etc.)
  • FlashAttention / FlashAttention-2 is assumed for prefill
  • KV cache quantization not considered
  • Speculative decoding not considered

Notes for the Agent

  • Always show intermediate calculations in a collapsible section or footnote if the user asks "how did you calculate this"
  • If VRAM is insufficient, proactively suggest the minimum quantization that would fit
  • If the user provides a config.json, confirm the parsed values before computing
  • Round all results to 2 significant figures for readability
  • For MoE models, clearly distinguish total vs activated parameters in all calculations

Comments

Loading comments...