Aiter Ck Gemm Tune

Tune AITER's CK GEMM and fused MoE kernels for specific model shapes on AMD GPUs. Covers shape discovery from inference logs, baseline benchmarking, kernel tuning, and before/after performance comparison.

Audits

Pass

ClawScanPass

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install aiter-ck-gemm-tune

AITER CK GEMM & MoE Tune

A skill for tuning AITER's Composable Kernel (CK) GEMM and fused MoE kernels to achieve better performance for specific model shapes. The tuning workflow is a multi-step process: discover the environment, capture shapes, run baseline benchmarks, tune kernels, and compare results. The workflow supports both regular GEMM variants (a8w8, bf16, etc.) and the moe_2stages variant for fused MoE kernels used in Mixture-of-Experts models.

Background

AITER (AI Tensor Engine for ROCm) is AMD's high-performance operator library for LLM inference on ROCm/AMD GPUs. It provides optimized kernels for common operations in transformer models — most critically, GEMM (General Matrix Multiply), which dominates the compute in LLM inference (linear projections, attention, MLP/FFN layers, MoE expert computations).

Composable Kernel (CK) is AMD's open-source library of GPU kernel primitives. CK provides templated, composable building blocks for writing high-performance GPU kernels. AITER uses CK to implement its GEMM kernels, with many kernel variants optimized for different quantization schemes (INT8, FP4, BF16) and memory layouts (blockscale, byte-pair reshuffle, batched, MoE).

Why tuning matters: Each CK GEMM kernel has many implementation variants (tile sizes, pipeline configurations, split-K strategies). The optimal variant depends on the specific GEMM shape (M, N, K) and the GPU hardware (number of compute units). AITER's tuning process benchmarks all candidate kernel configurations for each shape and selects the fastest one. Shapes come from specific model architectures — for example, a Llama 70B model produces different (N, K) pairs than a DeepSeek V3 model. The M dimension corresponds to the batch/token count and varies at runtime, so tuning sweeps M as powers of 2 to cover all realistic batch sizes.

How it fits into the inference stack: Inference frameworks like sglang and vllm call into AITER for their GEMM operations. When AITER encounters a shape that hasn't been tuned, it falls back to a default kernel configuration and logs a warning. The tuning workflow in this skill captures those untuned shapes and finds optimal kernel configurations for them.

Supported Kernel Variants

Each variant follows the same tuning workflow pattern. The table below maps each variant to its key files (all paths relative to the aiter root):

Variant	Tune Script	Untuned CSV	Tuned CSV	Test File	README
`a8w8`	`csrc/ck_gemm_a8w8/gemm_a8w8_tune.py`	`aiter/configs/a8w8_untuned_gemm.csv`	`aiter/configs/a8w8_tuned_gemm.csv`	`op_tests/test_gemm_a8w8.py`	`csrc/ck_gemm_a8w8/README.md`
`a8w8_blockscale`	`csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.py`	`aiter/configs/a8w8_blockscale_untuned_gemm.csv`	`aiter/configs/a8w8_blockscale_tuned_gemm.csv`	`op_tests/test_gemm_a8w8_blockscale.py`	`csrc/ck_gemm_a8w8_blockscale/README.md`
`a8w8_bpreshuffle`	`csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.py`	`aiter/configs/a8w8_bpreshuffle_untuned_gemm.csv`	`aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv`	(none)	`csrc/ck_gemm_a8w8_bpreshuffle/README.md`
`a8w8_blockscale_bpreshuffle`	`csrc/ck_gemm_a8w8_blockscale_bpreshuffle/gemm_a8w8_blockscale_bpreshuffle_tune.py`	`aiter/configs/a8w8_blockscale_bpreshuffle_untuned_gemm.csv`	`aiter/configs/a8w8_blockscale_bpreshuffle_tuned_gemm.csv`	(none)	`csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md`
`a4w4_blockscale`	`csrc/ck_gemm_a4w4_blockscale/gemm_a4w4_blockscale_tune.py`	`aiter/configs/a4w4_blockscale_untuned_gemm.csv`	`aiter/configs/a4w4_blockscale_tuned_gemm.csv`	`op_tests/test_gemm_a4w4.py`	`csrc/ck_gemm_a4w4_blockscale/README.md`
`batched_a8w8`	`csrc/ck_batched_gemm_a8w8/batched_gemm_a8w8_tune.py`	`aiter/configs/a8w8_untuned_batched_gemm.csv`	`aiter/configs/a8w8_tuned_batched_gemm.csv`	`op_tests/test_batched_gemm_a8w8.py`	`csrc/ck_batched_gemm_a8w8/README.md`
`batched_bf16`	`csrc/ck_batched_gemm_bf16/batched_gemm_bf16_tune.py`	`aiter/configs/bf16_untuned_batched_gemm.csv`	`aiter/configs/bf16_tuned_batched_gemm.csv`	`op_tests/test_batched_gemm_bf16.py`	`csrc/ck_batched_gemm_bf16/README.md`
`moe_2stages`	`csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py`	`aiter/configs/untuned_fmoe.csv`	`aiter/configs/tuned_fmoe.csv`	`op_tests/test_moe_2stage.py`	`csrc/ck_gemm_moe_2stages_codegen/README.md`

Log Files

The skill records outputs from Steps 2, 3, and 4 to log files under $AITER_PATH/tune_logs/. Use this naming convention:

$AITER_PATH/tune_logs/<variant>_bench_before_<YYYYMMDD_HHMMSS>.log  # Step 2: baseline benchmark
$AITER_PATH/tune_logs/<variant>_tuning_<YYYYMMDD_HHMMSS>.log        # Step 3: tuning process
$AITER_PATH/tune_logs/<variant>_bench_after_<YYYYMMDD_HHMMSS>.log   # Step 4: post-tune benchmark

For example:

tune_logs/a8w8_blockscale_bench_before_20260321_143022.log
tune_logs/a8w8_blockscale_tuning_20260321_150000.log
tune_logs/a8w8_blockscale_bench_after_20260321_160515.log

Create the tune_logs/ directory if it doesn't exist. For interactive commands (Steps 2 and 4), use 2>&1 | tee <log> to show output in real time while logging. For long-running background jobs (Step 3), redirect output to file directly (> <log> 2>&1).

Workflow

Follow these steps in order. At each step, communicate clearly with the user about what is happening, what you found, and what you plan to do next.

Step 0: Environment Discovery

Before anything else, establish the working environment. Tuning typically runs inside a Docker container on a remote node with AMD GPUs. Ask the user to provide access details upfront:

Target environment access: Ask the user how to reach the tuning environment:
- Node access: How to SSH into the node (e.g., ssh user@node-hostname)
- Docker container: The container name or ID to exec into (e.g., docker exec -it <container_name> bash)
- If the user is already inside the target environment (local or already SSH'd in), that's fine too — just confirm.
- All subsequent commands (Steps 1–4) should be run inside this environment.
Locate aiter: The pip package may be named aiter or amd-aiter, so use pip list | grep -i aiter to find the exact package name, then pip show <package_name> | grep Location to get its installed path. Do not guess common locations — there may be multiple aiter copies on the system, and only the one registered in pip is the active installation. Verify by checking that csrc/ and aiter/configs/ exist under that path.
Log location: Ask the user where the inference logs are. These could be from sglang, vllm, or another framework. Logs could also be provided directly. Logs may be on the node, inside the container, or on the user's local machine.
Verify aiter installation: Check if aiter is installed in dev mode. If not, warn the user that python3 setup.py develop from the aiter root may be needed.

Step 1: Capture Shapes & Identify Kernel Type

The goal is to extract the shapes that need tuning and determine which kernel variant to tune.

Option A: Parse from aiter logs (preferred)

AITER logs untuned shapes in two different patterns depending on the kernel type. The bundled script scripts/parse_untuned_shapes.py auto-detects both patterns in a single pass.

Regular GEMM pattern:

shape is M:<value>, N:<value>, K:<value> ... not found tuned config in /tmp/aiter_configs/<variant>_tuned_gemm.csv, will use default config!

Fused MoE pattern (the moe_2stages variant):

[fused_moe] using 1stage default for (cu_num, token, model_dim, inter_dim, expert, topk, 'ActivationType.X', 'torch.dtype', 'torch.dtype', 'torch.dtype', 'QuantType.X', use_g1u1, doweight_stage1)

The key word in the MoE pattern is "default" — it means no tuned config was found and the kernel falls back to heuristics. When a tuned config IS found, the log shows kernel names instead of "default".

Step 1a: Run the parser to see what's in the log:

python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file>

This prints all variants found. For regular GEMM, it shows unique (N, K) pairs. For moe_2stages, it shows unique MoE configs (model_dim, inter_dim, expert, topk, quant type, etc.) and the token counts seen in the log.

Step 1b: If multiple variants are found, ask the user which to tune. Each variant must be tuned separately (different tune scripts, CSVs, and test files). GEMM and MoE cannot be combined in one CSV — they have entirely different formats.

Step 1c: Generate the untuned CSV for the chosen variant(s):

# Regular GEMM variant with M sweep:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant a8w8_blockscale --csv <output.csv> --m-sweep

# Fused MoE — use actual token values from the log:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv>

# Fused MoE — sweep token as powers of 2 (more thorough):
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv> --token-sweep

Present the results to the user for confirmation before proceeding. If tuning multiple variants, repeat Steps 2–4 for each variant separately.

Option B: Direct user input

The user provides shapes and specifies the kernel variant directly.

Generating sweep values for tuning

Regular GEMM: For each unique (N, K) pair, generate tuning rows by sweeping M as powers of 2:

M = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768

This produces 16 × number_of_unique_NK_pairs rows for the untuned CSV.

Fused MoE: For each unique MoE config, either use the actual token values from the log (realistic) or sweep token as powers of 2 with --token-sweep (more thorough). There is no separate M dimension — the token count IS the batch dimension.

Note: The sweep for tuning (powers of 2) is separate from the values used for benchmarking in Steps 2/4. Benchmarking typically uses the test script's default list, which may include non-power-of-2 values. This is normal — we tune with powers of 2 to cover the key points.

Write the untuned CSV

The CSV format depends on the variant type:

Regular GEMM (e.g., a8w8_blockscale):

M,N,K
1,12288,4096
2,12288,4096
...
32768,12288,4096

Fused MoE (moe_2stages):

token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1
1,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
4,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
...

Write the CSV into the variant's untuned CSV file path (see the variant table above). Present the full shape list to the user before writing.

Step 2: Baseline Benchmark

Run the unit test for the target kernel variant with the specific shapes from Step 1 to establish baseline performance before tuning. No rebuild is needed at this point.

Pre-benchmark checklist: ck_preshuffle

Some test scripts have a --ck_preshuffle or --preshuffle flag (currently only a8w8_blockscale and moe_2stages). The correct setting can be inferred from the kernel variant detected in Step 1:

If the log shows a8w8_blockscale_tuned_gemm.csv (no "bpreshuffle" in the name) → use --ck_preshuffle False
If the log shows a8w8_blockscale_bpreshuffle_tuned_gemm.csv → use --ck_preshuffle True

Mention the inferred setting in your response to the user for confirmation, but no need to ask them to specify — the log already tells you.

Other variants (e.g., a8w8, a4w4_blockscale, batched variants) do not have this flag — skip this for them.

Handling test script `choices` constraints

Test scripts may have argparse choices restrictions on -m and/or -nk that reject values not in their hardcoded lists. Before running, read the argparse section at the bottom of the test file to check for choices constraints. If the shapes or M values you need are not in the choices list, you must modify the test script:

For -m: add missing M values (e.g., 16384, 32768) to both the choices and default lists.
For -nk: remove the choices parameter entirely (keep default) so any (N,K) pair can be passed.

CLI argument formats by variant

Variant	Test File	Shape Args	Example
`a8w8_blockscale`	`test_gemm_a8w8_blockscale.py`	`-m M1 M2 ... -nk N1,K1 N2,K2 ...`	`-m 1 2 4 ... 32768 -nk 12288,4096 24576,1536`
`a8w8`	`test_gemm_a8w8.py`	`-mnk M1,N1,K1 M2,N2,K2 ...`	`-mnk 1,12288,4096 2,12288,4096 4,12288,4096`
`a4w4_blockscale`	`test_gemm_a4w4.py`	`-mnk M1,N1,K1 M2,N2,K2 ...`	`-mnk 1,12288,4096 2,12288,4096 4,12288,4096`
`batched_a8w8`	`test_batched_gemm_a8w8.py`	`-s M1,N1,K1 M2,N2,K2 ...`	`-s 1,12288,4096 2,12288,4096 4,12288,4096`
`batched_bf16`	`test_batched_gemm_bf16.py`	`-s M1,N1,K1 M2,N2,K2 ...`	`-s 1,12288,4096 2,12288,4096 4,12288,4096`
`moe_2stages`	`test_moe_2stage.py`	`-t T1 T2 ... -dim M,I -e E -k K -q Q -a ACT -s DW -p PS`	See MoE example below

For regular GEMM variants that use -mnk or -s (combined M,N,K tuples), generate all combinations of the M sweep with each (N,K) pair. For a8w8_blockscale which takes -m and -nk separately, pass all M values once and all (N,K) pairs once — the test script handles the cross product internally.

Example for a8w8_blockscale with (N,K) pairs (12288,4096) and (24576,1536):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 12288,4096 24576,1536 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_before_$(date +%Y%m%d_%H%M%S).log

Example for a8w8 with (N,K) pair (12288,4096):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8.py \
  -mnk 1,12288,4096 2,12288,4096 4,12288,4096 8,12288,4096 \
  16,12288,4096 32,12288,4096 64,12288,4096 128,12288,4096 \
  256,12288,4096 512,12288,4096 1024,12288,4096 2048,12288,4096 \
  4096,12288,4096 8192,12288,4096 16384,12288,4096 32768,12288,4096 \
  2>&1 | tee tune_logs/a8w8_bench_before_$(date +%Y%m%d_%H%M%S).log

MoE-specific benchmark (`moe_2stages`)

The test_moe_2stage.py script has a completely different CLI from the regular GEMM tests. Map the MoE config fields from Step 1 to CLI args:

MoE Config Field	CLI Arg	Notes
token	`-t T1 T2 ...`	Space-separated list of token counts
model_dim, inter_dim	`-dim M,I`	Comma-separated pair
expert	`-e E`	Number of experts
topk	`-k K`	Top-K experts
act_type	`-a silu` or `-a gelu`	Activation function
q_type	`-q N`	Quant index (see mapping below)
doweight_stage1	`-s f` or `-s t`	f=False, t=True
preshuffle	`-p f` or `-p t`	f=False, t=True

Quant index (-q) mapping — the -q value maps to (QuantType, q_dtype_a, q_dtype_w):

`-q`	QuantType	q_dtype_a	q_dtype_w	Common Name
0	No	None	None	a16w16 (no quant)
1	per_Tensor	fp8	fp8	a8w8 per-tensor
2	per_Token	fp8	fp8	a8w8 per-token
3	per_Token	fp8	int4	a8w4
4	per_1x32	fp4x2	fp4x2	a4w4
5	per_128x128	fp8	fp8	a8w8 blockscale
6	per_1x32	bf16	fp4x2	a16w4
7	per_1x32	fp8	fp4x2	a8w4

To determine the correct -q value, match the q_type, q_dtype_a, and q_dtype_w from the log against this table. For example, QuantType.per_1x128 with fp8/fp8 maps to -q 5.

Note: QuantType.per_1x128 in the log corresponds to -q 5 (per_128x128 in the test). The name difference (per_1x128 vs per_128x128) is a known inconsistency between the log and the test script — they refer to the same blockscale FP8 quantization.

Example for MoE with Qwen3.5 shapes (fp8 blockscale, 512 experts, topk=10):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_moe_2stage.py \
  -t 1 4 8 32 64 128 1024 2048 16384 \
  -dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
  2>&1 | tee tune_logs/moe_2stages_bench_before_$(date +%Y%m%d_%H%M%S).log

MoE bypass caveat

For moe_2stages with blockscale FP8 (QuantType.per_1x128), there is a bypass in fused_moe.py that skips tuned configs when token * topk <= 128. This means:

For topk=10 (e.g., Qwen3.5): tokens 1–12 always use default heuristics regardless of tuning
For topk=2 (e.g., DeepSeek): tokens 1–64 always use default heuristics

Tuning these small token counts still produces valid configs, but they won't be used at inference time for this specific quant type. This is by design — the assembly kernel heuristics perform well enough at very small batch sizes. Focus benchmark attention on token counts above the bypass threshold.

Record the baseline log file path — you will need it in Step 4 for comparison.

If the variant has no test file (e.g., a8w8_bpreshuffle), inform the user and ask how they'd like to benchmark.

Step 3: Tune

Check available GPUs

Before tuning, run rocm-smi to check how many GPUs are free. Use --mp <num_free_gpus> to parallelize tuning across all available GPUs — this can dramatically reduce tuning time (e.g., 8x faster with 8 GPUs vs 1).

rocm-smi --showuse | grep "GPU use"

Run the tuning script

The general command pattern is:

cd $AITER_PATH
python3 <tune_script> -i <untuned_csv> -o <tuned_csv> [options]

Tuning is a long-running job (potentially hours). Run it in the background with output redirected to a log file. Use nohup to ensure the process survives if the SSH session disconnects:

Example for a8w8_blockscale with 8 free GPUs:

cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.py \
  -i aiter/configs/a8w8_blockscale_untuned_gemm.csv \
  -o aiter/configs/a8w8_blockscale_tuned_gemm.csv \
  --libtype both --mp 8 --timeout 600 \
  > tune_logs/a8w8_blockscale_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &

Example for moe_2stages with 8 free GPUs:

cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py \
  -i aiter/configs/untuned_fmoe.csv \
  -o aiter/configs/tuned_fmoe.csv \
  --mp 8 --timeout 120 \
  > tune_logs/moe_2stages_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &

Note: the MoE tuner does not have --libtype. Use --timeout 120 (shorter than GEMM since MoE shapes tune faster).

After launching, verify the process is running and monitor progress:

# Verify tuning process started (do NOT rely on $! — it doesn't work reliably through docker exec layers)
ps aux | grep tune.py | grep -v grep

# Monitor progress by tailing the log file
tail -f tune_logs/a8w8_blockscale_tuning_*.log

Key flags to consider

Flag	Default	Description
`--libtype`	—	`ck`, `cktile`, or `both` (recommend `both` for best results)
`--mp N`	all GPUs	Number of parallel GPU processes — set to number of free GPUs
`--batch N`	100	Shapes per tuning batch
`--errRatio`	0.05	Error tolerance threshold
`-k` / `--splitK`	off	Enable split-K optimization
`--warmup N`	5	Warmup iterations before profiling
`--iters N`	101	Profiling iterations
`--timeout N`	none	Timeout in seconds per task group (recommend `600`)
`-v`	off	Verbose output
`--all`	off	Retune all shapes

Important warnings to communicate to the user:

Tuning can take a very long time (potentially hours) depending on the number of shapes and options
Using --libtype both is slower but produces better results
Use --mp with all available GPUs to maximize parallelism
--timeout is recommended to prevent individual shapes from hanging
The first run includes a JIT compilation step that can take several minutes before actual tuning begins

Step 4: Rerun & Compare

After tuning completes, rerun the benchmark to measure improvement. Reuse the exact same command from Step 2 with these changes:

For regular GEMM variants:

Prepend AITER_REBUILD=1 to force aiter to rebuild kernels using the newly tuned CSV
Change the log filename from bench_before to bench_after

For moe_2stages:

Prepend AITER_REBUILD=1 (same as GEMM)
Optionally set AITER_CONFIG_FMOE=<path_to_tuned_csv> if the tuned CSV is in a non-default location
Change the log filename from bench_before to bench_after

This ensures the same shapes and flags are used for an apples-to-apples comparison. Do not re-type the command manually — copy the Step 2 command and apply the changes above.

Example — GEMM, if Step 2 command was:

python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_before_20260321_143022.log

Then Step 4 command is:

AITER_REBUILD=1 python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_after_$(date +%Y%m%d_%H%M%S).log

Example — MoE after-benchmark:

AITER_REBUILD=1 AITER_CONFIG_FMOE=aiter/configs/tuned_fmoe.csv \
  python3 op_tests/test_moe_2stage.py \
  -t 1 4 8 32 64 128 1024 2048 16384 \
  -dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
  2>&1 | tee tune_logs/moe_2stages_bench_after_$(date +%Y%m%d_%H%M%S).log

The AITER_REBUILD=1 flag is essential — without it, old cached kernels will be used and you won't see improvements. The first run after tuning will take extra time for JIT rebuilding.

Compare results using the bundled comparison script:

python3 <skill_path>/scripts/compare_results.py \
  tune_logs/<variant>_bench_before_<timestamp>.log \
  tune_logs/<variant>_bench_after_<timestamp>.log

The script auto-detects the log format (GEMM vs MoE) and selects the appropriate comparison mode:

GEMM logs: matches shapes by (M, N, K), default metric is ck TFLOPS (higher is better)
MoE logs: matches shapes by (token, model_dim, inter_dim, E, topk), default metric is us (latency in microseconds, lower is better)

Both modes produce:

A per-shape comparison table with before/after values and speedup %
A summary with average/min/max speedup and improved/regressed counts
A per-config breakdown grouped by size category (small/medium/large)

You can override the metric with --metric "ck us" (latency) or --metric "asm TFLOPS".

Present the comparison results to the user and tell them where both log files are stored.

Step 5: Generate Report

After completing the comparison, generate a tuning report and save it to $AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md. The report should contain:

Environment summary: GPU model, aiter version, aiter path
Shapes tuned: the (N, K) pairs or MoE configs, and kernel variant
Tuning configuration: flags used (--libtype, --mp, --timeout, etc.)
Full comparison table: the complete output from compare_results.py — include every shape, not a summary. This is the primary content of the report.
Summary statistics: average/min/max speedup, improved/regressed counts, grouped by size category:
- GEMM: per-(N,K) breakdown grouped by M category (Small M 1-63 decode, Medium M 64-512, Large M >512 prefill)
- MoE: per-config breakdown grouped by token category (Small token 1-63 decode, Medium token 64-512, Large token >512 prefill)
Log file locations: paths to all log files (bench_before, tuning, bench_after)

Generate the report by running the comparison script and capturing its output:

python3 <skill_path>/scripts/compare_results.py \
  tune_logs/<variant>_bench_before_<timestamp>.log \
  tune_logs/<variant>_bench_after_<timestamp>.log \
  > /tmp/compare_output.txt

Then assemble the full report as a markdown file. Save the report in two locations:

Remote: $AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md (inside the tuning environment, alongside the log files)
Local: a copy in the user's current working directory or a location they specify

Present the report to the user and tell them where both copies are saved.

Troubleshooting

If anything fails at any step, check the variant's README at $AITER_PATH/csrc/<kernel_dir>/README.md — it contains variant-specific guidance, known issues, and examples.

Common issues:

JIT build fails: The first run may take several minutes as kernels are built via JIT. Be patient.
AITER_REBUILD=1 forgotten in Step 4: Without this flag, old cached kernels will be used, and you won't see tuning improvements.
Stale builds with PREBUILD_KERNELS=1: If aiter was installed with PREBUILD_KERNELS=1, you may need to remove build/ and *.so in aiter/jit/ and reinstall aiter to pick up new tuned kernels.
Tuning hangs on certain shapes: Use --timeout to skip shapes that take too long.
Low accuracy (high errRatio): Tighten --errRatio (e.g., 0.01) to filter out inaccurate kernel candidates.