Install
openclaw skills install aiter-ck-gemm-tuneTune AITER's CK GEMM and fused MoE kernels for specific model shapes on AMD GPUs. Covers shape discovery from inference logs, baseline benchmarking, kernel tuning, and before/after performance comparison.
openclaw skills install aiter-ck-gemm-tuneA skill for tuning AITER's Composable Kernel (CK) GEMM and fused MoE kernels to achieve better performance for specific model shapes. The tuning workflow is a multi-step process: discover the environment, capture shapes, run baseline benchmarks, tune kernels, and compare results. The workflow supports both regular GEMM variants (a8w8, bf16, etc.) and the moe_2stages variant for fused MoE kernels used in Mixture-of-Experts models.
AITER (AI Tensor Engine for ROCm) is AMD's high-performance operator library for LLM inference on ROCm/AMD GPUs. It provides optimized kernels for common operations in transformer models — most critically, GEMM (General Matrix Multiply), which dominates the compute in LLM inference (linear projections, attention, MLP/FFN layers, MoE expert computations).
Composable Kernel (CK) is AMD's open-source library of GPU kernel primitives. CK provides templated, composable building blocks for writing high-performance GPU kernels. AITER uses CK to implement its GEMM kernels, with many kernel variants optimized for different quantization schemes (INT8, FP4, BF16) and memory layouts (blockscale, byte-pair reshuffle, batched, MoE).
Why tuning matters: Each CK GEMM kernel has many implementation variants (tile sizes, pipeline configurations, split-K strategies). The optimal variant depends on the specific GEMM shape (M, N, K) and the GPU hardware (number of compute units). AITER's tuning process benchmarks all candidate kernel configurations for each shape and selects the fastest one. Shapes come from specific model architectures — for example, a Llama 70B model produces different (N, K) pairs than a DeepSeek V3 model. The M dimension corresponds to the batch/token count and varies at runtime, so tuning sweeps M as powers of 2 to cover all realistic batch sizes.
How it fits into the inference stack: Inference frameworks like sglang and vllm call into AITER for their GEMM operations. When AITER encounters a shape that hasn't been tuned, it falls back to a default kernel configuration and logs a warning. The tuning workflow in this skill captures those untuned shapes and finds optimal kernel configurations for them.
Each variant follows the same tuning workflow pattern. The table below maps each variant to its key files (all paths relative to the aiter root):
| Variant | Tune Script | Untuned CSV | Tuned CSV | Test File | README |
|---|---|---|---|---|---|
a8w8 | csrc/ck_gemm_a8w8/gemm_a8w8_tune.py | aiter/configs/a8w8_untuned_gemm.csv | aiter/configs/a8w8_tuned_gemm.csv | op_tests/test_gemm_a8w8.py | csrc/ck_gemm_a8w8/README.md |
a8w8_blockscale | csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.py | aiter/configs/a8w8_blockscale_untuned_gemm.csv | aiter/configs/a8w8_blockscale_tuned_gemm.csv | op_tests/test_gemm_a8w8_blockscale.py | csrc/ck_gemm_a8w8_blockscale/README.md |
a8w8_bpreshuffle | csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.py | aiter/configs/a8w8_bpreshuffle_untuned_gemm.csv | aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv | (none) | csrc/ck_gemm_a8w8_bpreshuffle/README.md |
a8w8_blockscale_bpreshuffle | csrc/ck_gemm_a8w8_blockscale_bpreshuffle/gemm_a8w8_blockscale_bpreshuffle_tune.py | aiter/configs/a8w8_blockscale_bpreshuffle_untuned_gemm.csv | aiter/configs/a8w8_blockscale_bpreshuffle_tuned_gemm.csv | (none) | csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md |
a4w4_blockscale | csrc/ck_gemm_a4w4_blockscale/gemm_a4w4_blockscale_tune.py | aiter/configs/a4w4_blockscale_untuned_gemm.csv | aiter/configs/a4w4_blockscale_tuned_gemm.csv | op_tests/test_gemm_a4w4.py | csrc/ck_gemm_a4w4_blockscale/README.md |
batched_a8w8 | csrc/ck_batched_gemm_a8w8/batched_gemm_a8w8_tune.py | aiter/configs/a8w8_untuned_batched_gemm.csv | aiter/configs/a8w8_tuned_batched_gemm.csv | op_tests/test_batched_gemm_a8w8.py | csrc/ck_batched_gemm_a8w8/README.md |
batched_bf16 | csrc/ck_batched_gemm_bf16/batched_gemm_bf16_tune.py | aiter/configs/bf16_untuned_batched_gemm.csv | aiter/configs/bf16_tuned_batched_gemm.csv | op_tests/test_batched_gemm_bf16.py | csrc/ck_batched_gemm_bf16/README.md |
moe_2stages | csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py | aiter/configs/untuned_fmoe.csv | aiter/configs/tuned_fmoe.csv | op_tests/test_moe_2stage.py | csrc/ck_gemm_moe_2stages_codegen/README.md |
The skill records outputs from Steps 2, 3, and 4 to log files under $AITER_PATH/tune_logs/. Use this naming convention:
$AITER_PATH/tune_logs/<variant>_bench_before_<YYYYMMDD_HHMMSS>.log # Step 2: baseline benchmark
$AITER_PATH/tune_logs/<variant>_tuning_<YYYYMMDD_HHMMSS>.log # Step 3: tuning process
$AITER_PATH/tune_logs/<variant>_bench_after_<YYYYMMDD_HHMMSS>.log # Step 4: post-tune benchmark
For example:
tune_logs/a8w8_blockscale_bench_before_20260321_143022.log
tune_logs/a8w8_blockscale_tuning_20260321_150000.log
tune_logs/a8w8_blockscale_bench_after_20260321_160515.log
Create the tune_logs/ directory if it doesn't exist. For interactive commands (Steps 2 and 4), use 2>&1 | tee <log> to show output in real time while logging. For long-running background jobs (Step 3), redirect output to file directly (> <log> 2>&1).
Follow these steps in order. At each step, communicate clearly with the user about what is happening, what you found, and what you plan to do next.
Before anything else, establish the working environment. Tuning typically runs inside a Docker container on a remote node with AMD GPUs. Ask the user to provide access details upfront:
ssh user@node-hostname)docker exec -it <container_name> bash)aiter or amd-aiter, so use pip list | grep -i aiter to find the exact package name, then pip show <package_name> | grep Location to get its installed path. Do not guess common locations — there may be multiple aiter copies on the system, and only the one registered in pip is the active installation. Verify by checking that csrc/ and aiter/configs/ exist under that path.python3 setup.py develop from the aiter root may be needed.The goal is to extract the shapes that need tuning and determine which kernel variant to tune.
AITER logs untuned shapes in two different patterns depending on the kernel type. The bundled script scripts/parse_untuned_shapes.py auto-detects both patterns in a single pass.
Regular GEMM pattern:
shape is M:<value>, N:<value>, K:<value> ... not found tuned config in /tmp/aiter_configs/<variant>_tuned_gemm.csv, will use default config!
Fused MoE pattern (the moe_2stages variant):
[fused_moe] using 1stage default for (cu_num, token, model_dim, inter_dim, expert, topk, 'ActivationType.X', 'torch.dtype', 'torch.dtype', 'torch.dtype', 'QuantType.X', use_g1u1, doweight_stage1)
The key word in the MoE pattern is "default" — it means no tuned config was found and the kernel falls back to heuristics. When a tuned config IS found, the log shows kernel names instead of "default".
Step 1a: Run the parser to see what's in the log:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file>
This prints all variants found. For regular GEMM, it shows unique (N, K) pairs. For moe_2stages, it shows unique MoE configs (model_dim, inter_dim, expert, topk, quant type, etc.) and the token counts seen in the log.
Step 1b: If multiple variants are found, ask the user which to tune. Each variant must be tuned separately (different tune scripts, CSVs, and test files). GEMM and MoE cannot be combined in one CSV — they have entirely different formats.
Step 1c: Generate the untuned CSV for the chosen variant(s):
# Regular GEMM variant with M sweep:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant a8w8_blockscale --csv <output.csv> --m-sweep
# Fused MoE — use actual token values from the log:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv>
# Fused MoE — sweep token as powers of 2 (more thorough):
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv> --token-sweep
Present the results to the user for confirmation before proceeding. If tuning multiple variants, repeat Steps 2–4 for each variant separately.
The user provides shapes and specifies the kernel variant directly.
Regular GEMM: For each unique (N, K) pair, generate tuning rows by sweeping M as powers of 2:
M = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
This produces 16 × number_of_unique_NK_pairs rows for the untuned CSV.
Fused MoE: For each unique MoE config, either use the actual token values from the log (realistic) or sweep token as powers of 2 with --token-sweep (more thorough). There is no separate M dimension — the token count IS the batch dimension.
Note: The sweep for tuning (powers of 2) is separate from the values used for benchmarking in Steps 2/4. Benchmarking typically uses the test script's default list, which may include non-power-of-2 values. This is normal — we tune with powers of 2 to cover the key points.
The CSV format depends on the variant type:
Regular GEMM (e.g., a8w8_blockscale):
M,N,K
1,12288,4096
2,12288,4096
...
32768,12288,4096
Fused MoE (moe_2stages):
token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1
1,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
4,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
...
Write the CSV into the variant's untuned CSV file path (see the variant table above). Present the full shape list to the user before writing.
Run the unit test for the target kernel variant with the specific shapes from Step 1 to establish baseline performance before tuning. No rebuild is needed at this point.
Some test scripts have a --ck_preshuffle or --preshuffle flag (currently only a8w8_blockscale and moe_2stages). The correct setting can be inferred from the kernel variant detected in Step 1:
a8w8_blockscale_tuned_gemm.csv (no "bpreshuffle" in the name) → use --ck_preshuffle Falsea8w8_blockscale_bpreshuffle_tuned_gemm.csv → use --ck_preshuffle TrueMention the inferred setting in your response to the user for confirmation, but no need to ask them to specify — the log already tells you.
Other variants (e.g., a8w8, a4w4_blockscale, batched variants) do not have this flag — skip this for them.
choices constraintsTest scripts may have argparse choices restrictions on -m and/or -nk that reject values not in their hardcoded lists. Before running, read the argparse section at the bottom of the test file to check for choices constraints. If the shapes or M values you need are not in the choices list, you must modify the test script:
-m: add missing M values (e.g., 16384, 32768) to both the choices and default lists.-nk: remove the choices parameter entirely (keep default) so any (N,K) pair can be passed.| Variant | Test File | Shape Args | Example |
|---|---|---|---|
a8w8_blockscale | test_gemm_a8w8_blockscale.py | -m M1 M2 ... -nk N1,K1 N2,K2 ... | -m 1 2 4 ... 32768 -nk 12288,4096 24576,1536 |
a8w8 | test_gemm_a8w8.py | -mnk M1,N1,K1 M2,N2,K2 ... | -mnk 1,12288,4096 2,12288,4096 4,12288,4096 |
a4w4_blockscale | test_gemm_a4w4.py | -mnk M1,N1,K1 M2,N2,K2 ... | -mnk 1,12288,4096 2,12288,4096 4,12288,4096 |
batched_a8w8 | test_batched_gemm_a8w8.py | -s M1,N1,K1 M2,N2,K2 ... | -s 1,12288,4096 2,12288,4096 4,12288,4096 |
batched_bf16 | test_batched_gemm_bf16.py | -s M1,N1,K1 M2,N2,K2 ... | -s 1,12288,4096 2,12288,4096 4,12288,4096 |
moe_2stages | test_moe_2stage.py | -t T1 T2 ... -dim M,I -e E -k K -q Q -a ACT -s DW -p PS | See MoE example below |
For regular GEMM variants that use -mnk or -s (combined M,N,K tuples), generate all combinations of the M sweep with each (N,K) pair. For a8w8_blockscale which takes -m and -nk separately, pass all M values once and all (N,K) pairs once — the test script handles the cross product internally.
Example for a8w8_blockscale with (N,K) pairs (12288,4096) and (24576,1536):
cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8_blockscale.py \
-m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
-nk 12288,4096 24576,1536 \
--ck_preshuffle False \
2>&1 | tee tune_logs/a8w8_blockscale_bench_before_$(date +%Y%m%d_%H%M%S).log
Example for a8w8 with (N,K) pair (12288,4096):
cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8.py \
-mnk 1,12288,4096 2,12288,4096 4,12288,4096 8,12288,4096 \
16,12288,4096 32,12288,4096 64,12288,4096 128,12288,4096 \
256,12288,4096 512,12288,4096 1024,12288,4096 2048,12288,4096 \
4096,12288,4096 8192,12288,4096 16384,12288,4096 32768,12288,4096 \
2>&1 | tee tune_logs/a8w8_bench_before_$(date +%Y%m%d_%H%M%S).log
moe_2stages)The test_moe_2stage.py script has a completely different CLI from the regular GEMM tests. Map the MoE config fields from Step 1 to CLI args:
| MoE Config Field | CLI Arg | Notes |
|---|---|---|
| token | -t T1 T2 ... | Space-separated list of token counts |
| model_dim, inter_dim | -dim M,I | Comma-separated pair |
| expert | -e E | Number of experts |
| topk | -k K | Top-K experts |
| act_type | -a silu or -a gelu | Activation function |
| q_type | -q N | Quant index (see mapping below) |
| doweight_stage1 | -s f or -s t | f=False, t=True |
| preshuffle | -p f or -p t | f=False, t=True |
Quant index (-q) mapping — the -q value maps to (QuantType, q_dtype_a, q_dtype_w):
-q | QuantType | q_dtype_a | q_dtype_w | Common Name |
|---|---|---|---|---|
| 0 | No | None | None | a16w16 (no quant) |
| 1 | per_Tensor | fp8 | fp8 | a8w8 per-tensor |
| 2 | per_Token | fp8 | fp8 | a8w8 per-token |
| 3 | per_Token | fp8 | int4 | a8w4 |
| 4 | per_1x32 | fp4x2 | fp4x2 | a4w4 |
| 5 | per_128x128 | fp8 | fp8 | a8w8 blockscale |
| 6 | per_1x32 | bf16 | fp4x2 | a16w4 |
| 7 | per_1x32 | fp8 | fp4x2 | a8w4 |
To determine the correct -q value, match the q_type, q_dtype_a, and q_dtype_w from the log against this table. For example, QuantType.per_1x128 with fp8/fp8 maps to -q 5.
Note:
QuantType.per_1x128in the log corresponds to-q 5(per_128x128in the test). The name difference (per_1x128vsper_128x128) is a known inconsistency between the log and the test script — they refer to the same blockscale FP8 quantization.
Example for MoE with Qwen3.5 shapes (fp8 blockscale, 512 experts, topk=10):
cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_moe_2stage.py \
-t 1 4 8 32 64 128 1024 2048 16384 \
-dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
2>&1 | tee tune_logs/moe_2stages_bench_before_$(date +%Y%m%d_%H%M%S).log
For moe_2stages with blockscale FP8 (QuantType.per_1x128), there is a bypass in fused_moe.py that skips tuned configs when token * topk <= 128. This means:
Tuning these small token counts still produces valid configs, but they won't be used at inference time for this specific quant type. This is by design — the assembly kernel heuristics perform well enough at very small batch sizes. Focus benchmark attention on token counts above the bypass threshold.
Record the baseline log file path — you will need it in Step 4 for comparison.
If the variant has no test file (e.g., a8w8_bpreshuffle), inform the user and ask how they'd like to benchmark.
Before tuning, run rocm-smi to check how many GPUs are free. Use --mp <num_free_gpus> to parallelize tuning across all available GPUs — this can dramatically reduce tuning time (e.g., 8x faster with 8 GPUs vs 1).
rocm-smi --showuse | grep "GPU use"
The general command pattern is:
cd $AITER_PATH
python3 <tune_script> -i <untuned_csv> -o <tuned_csv> [options]
Tuning is a long-running job (potentially hours). Run it in the background with output redirected to a log file. Use nohup to ensure the process survives if the SSH session disconnects:
Example for a8w8_blockscale with 8 free GPUs:
cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.py \
-i aiter/configs/a8w8_blockscale_untuned_gemm.csv \
-o aiter/configs/a8w8_blockscale_tuned_gemm.csv \
--libtype both --mp 8 --timeout 600 \
> tune_logs/a8w8_blockscale_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &
Example for moe_2stages with 8 free GPUs:
cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py \
-i aiter/configs/untuned_fmoe.csv \
-o aiter/configs/tuned_fmoe.csv \
--mp 8 --timeout 120 \
> tune_logs/moe_2stages_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &
Note: the MoE tuner does not have --libtype. Use --timeout 120 (shorter than GEMM since MoE shapes tune faster).
After launching, verify the process is running and monitor progress:
# Verify tuning process started (do NOT rely on $! — it doesn't work reliably through docker exec layers)
ps aux | grep tune.py | grep -v grep
# Monitor progress by tailing the log file
tail -f tune_logs/a8w8_blockscale_tuning_*.log
| Flag | Default | Description |
|---|---|---|
--libtype | — | ck, cktile, or both (recommend both for best results) |
--mp N | all GPUs | Number of parallel GPU processes — set to number of free GPUs |
--batch N | 100 | Shapes per tuning batch |
--errRatio | 0.05 | Error tolerance threshold |
-k / --splitK | off | Enable split-K optimization |
--warmup N | 5 | Warmup iterations before profiling |
--iters N | 101 | Profiling iterations |
--timeout N | none | Timeout in seconds per task group (recommend 600) |
-v | off | Verbose output |
--all | off | Retune all shapes |
Important warnings to communicate to the user:
--libtype both is slower but produces better results--mp with all available GPUs to maximize parallelism--timeout is recommended to prevent individual shapes from hangingAfter tuning completes, rerun the benchmark to measure improvement. Reuse the exact same command from Step 2 with these changes:
For regular GEMM variants:
AITER_REBUILD=1 to force aiter to rebuild kernels using the newly tuned CSVbench_before to bench_afterFor moe_2stages:
AITER_REBUILD=1 (same as GEMM)AITER_CONFIG_FMOE=<path_to_tuned_csv> if the tuned CSV is in a non-default locationbench_before to bench_afterThis ensures the same shapes and flags are used for an apples-to-apples comparison. Do not re-type the command manually — copy the Step 2 command and apply the changes above.
Example — GEMM, if Step 2 command was:
python3 op_tests/test_gemm_a8w8_blockscale.py \
-m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
-nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
--ck_preshuffle False \
2>&1 | tee tune_logs/a8w8_blockscale_bench_before_20260321_143022.log
Then Step 4 command is:
AITER_REBUILD=1 python3 op_tests/test_gemm_a8w8_blockscale.py \
-m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
-nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
--ck_preshuffle False \
2>&1 | tee tune_logs/a8w8_blockscale_bench_after_$(date +%Y%m%d_%H%M%S).log
Example — MoE after-benchmark:
AITER_REBUILD=1 AITER_CONFIG_FMOE=aiter/configs/tuned_fmoe.csv \
python3 op_tests/test_moe_2stage.py \
-t 1 4 8 32 64 128 1024 2048 16384 \
-dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
2>&1 | tee tune_logs/moe_2stages_bench_after_$(date +%Y%m%d_%H%M%S).log
The AITER_REBUILD=1 flag is essential — without it, old cached kernels will be used and you won't see improvements. The first run after tuning will take extra time for JIT rebuilding.
Compare results using the bundled comparison script:
python3 <skill_path>/scripts/compare_results.py \
tune_logs/<variant>_bench_before_<timestamp>.log \
tune_logs/<variant>_bench_after_<timestamp>.log
The script auto-detects the log format (GEMM vs MoE) and selects the appropriate comparison mode:
ck TFLOPS (higher is better)us (latency in microseconds, lower is better)Both modes produce:
You can override the metric with --metric "ck us" (latency) or --metric "asm TFLOPS".
Present the comparison results to the user and tell them where both log files are stored.
After completing the comparison, generate a tuning report and save it to $AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md. The report should contain:
--libtype, --mp, --timeout, etc.)compare_results.py — include every shape, not a summary. This is the primary content of the report.Generate the report by running the comparison script and capturing its output:
python3 <skill_path>/scripts/compare_results.py \
tune_logs/<variant>_bench_before_<timestamp>.log \
tune_logs/<variant>_bench_after_<timestamp>.log \
> /tmp/compare_output.txt
Then assemble the full report as a markdown file. Save the report in two locations:
$AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md (inside the tuning environment, alongside the log files)Present the report to the user and tell them where both copies are saved.
If anything fails at any step, check the variant's README at $AITER_PATH/csrc/<kernel_dir>/README.md — it contains variant-specific guidance, known issues, and examples.
Common issues:
AITER_REBUILD=1 forgotten in Step 4: Without this flag, old cached kernels will be used, and you won't see tuning improvements.PREBUILD_KERNELS=1: If aiter was installed with PREBUILD_KERNELS=1, you may need to remove build/ and *.so in aiter/jit/ and reinstall aiter to pick up new tuned kernels.--timeout to skip shapes that take too long.--errRatio (e.g., 0.01) to filter out inaccurate kernel candidates.