{"skill":{"slug":"sglang-amd-bench","displayName":"Sglang Amd Bench","summary":"Benchmark sglang serving performance on AMD Instinct GPUs (MI355X, MI300X, MI308X) with various parallel configurations (TP, DP, EP). Covers throughput/laten...","description":"---\nname: sglang-amd-bench\ndescription: >\n  Benchmark sglang serving performance on AMD Instinct GPUs (MI355X, MI300X, MI308X)\n  with various parallel configurations (TP, DP, EP). Covers throughput/latency sweeps\n  (ISL, OSL, concurrency), TTFT/TPOT measurement, and config comparison. Mix mode only.\n---\n\n# SGLang AMD Benchmark\n\nBenchmark sglang LLM serving on AMD Instinct GPUs across parallel configurations (TP/DP/EP) and workload shapes (ISL/OSL/Concurrency). This skill runs in **mix mode** (non-disaggregated) — prefill and decode happen on the same GPUs. It produces a performance baseline and suggests config-level optimizations.\n\n## Run Rules (non-negotiable)\n\nThese rules apply to every benchmark run in this skill. (A profiling-stage-separation rule exists in the broader sglang-run guidance but is intentionally omitted here, since this skill does not profile.)\n\n### Rule 1 — Do NOT modify the sglang/aiter/mori environment\n\n**Never** run `pip install`, `pip uninstall`, `pip install --upgrade`, or any equivalent reinstall command for `sglang`, `aiter`, `mori`, `flydsl`, or any related kernel/runtime package — even if a workload fails or imports look broken. The user's environments are hand-tuned dev installs (typically `pip install -e .`); a naive reinstall will silently overwrite local patches and destroy hours of work.\n\nIf the environment looks broken (missing module, version mismatch, ABI error, import crash), **STOP** and report the symptom to the user. Let the user decide whether to reinstall.\n\nWhat you CAN do without asking:\n- Inspect versions: `pip show sglang`, `python -c \"import sglang; print(sglang.__file__)\"`\n- Read source files in the editable install\n- Set environment variables for the run\n\nWhat you MUST ask before doing:\n- `pip install` / `pip uninstall` / `pip install -U` for any package above\n- `git checkout` / `git pull` inside the editable source directories\n- Modifying files inside `sglang/`, `aiter/`, `mori/` source trees\n\n### Rule 2 — Always preserve server logs when launching an sglang server\n\nWhenever you start an sglang server, redirect stdout+stderr to a real file. Never let server output go only to the terminal or to `/dev/null`. The Bash tool's `run_in_background: true` buffer is **not** a substitute — still redirect to a file.\n\nIn this skill, `serve.sh` writes to `$LOG_DIR/server_<LABEL>.log` automatically — that's what satisfies this rule, and what `wait_for_server.py` (Rule 3) reads.\n\n### Rule 3 — Wait for the server with the bundled monitor, don't blind-sleep\n\nAfter launching an sglang server, startup typically takes a few minutes (model load, weight shard, kernel warmup, graph capture; AITER may JIT-compile CK kernels for several minutes on first launch). Do **not** `sleep 300` and hope. Use the bundled monitor — it polls the log and returns the moment the outcome is known:\n\n```bash\n# After 3-0 deploys it, the script lives at /sgl-workspace/wait_for_server.py inside the container.\npython3 /sgl-workspace/wait_for_server.py \"$SERVER_LOG\"\n# exit codes:\n#   0 READY    — saw \"The server is fired up and ready to roll\"\n#   1 CRASHED  — saw \"Traceback\"\n#   2 HUNG     — log's last line + line count unchanged for >5 min\n#   3 TIMEOUT  — overall timeout (default 30 min) exceeded\n#   4 ERROR    — log file unreadable / never appeared\n```\n\nSource lives at `scripts/wait_for_server.py` in this skill's directory; 3-0 copies it to `/sgl-workspace/` alongside `serve.sh` / `bench.sh`. Detection logic:\n- **Success**: substring `The server is fired up and ready to roll` appears.\n- **Crash**: substring `Traceback` appears.\n- **Hang**: each poll records `(line_count, last_non_empty_line)` of the log; unchanged for ≥5 minutes (`--stall-seconds`) → treated as failed.\n\nTunable flags: `--success`, `--failure`, `--stall-seconds`, `--overall-timeout`, `--poll-seconds`. Bump `--stall-seconds` consciously if a specific config genuinely has long quiet periods (e.g. very large weight downloads, prolonged AITER JIT).\n\nOn `CRASHED` / `HUNG` / `TIMEOUT` / `ERROR`: stop and report the log tail to the user; do NOT silently restart.\n\n## Important Notes\n\n- This skill covers **mix mode only** (no PD-disaggregation). Prefill and decode run on the same GPUs.\n- `serve.sh` sets `SGLANG_USE_AITER=1` automatically. `bench.sh` sets `PYTHONPATH` for sglang's benchmark module automatically. No need to set these manually.\n- **Use dummy weights by default** (`LOAD_DUMMY=1`). Dummy weights are sufficient for benchmarking throughput, latency, and parallel config comparison — real weights produce the same performance characteristics. Only use `LOAD_DUMMY=0` if the user explicitly asks for real weights. Real weights take much longer to load (10+ minutes for large models) and are rarely needed for config benchmarking.\n- `--random-range-ratio 1.0` ensures exact ISL/OSL lengths (no variation) for reproducible benchmarks.\n- `bench.sh` uses `num_prompts = concurrency * 2` — this is handled by the script automatically.\n- Between configs, fully kill the sglang server and wait for GPU memory to be freed before relaunching.\n- If a benchmark run fails or hangs, check GPU memory usage with `rocm-smi` and server health with the `/health` endpoint.\n\n## Key Metrics\n\nEvery benchmark collects these metrics per (ISL, OSL, Concurrency) combination:\n\n\n| Metric             | Unit  | Description                                               |\n| ------------------ | ----- | --------------------------------------------------------- |\n| TTFT               | ms    | Time To First Token — latency from request to first token |\n| TPOT               | ms    | Time Per Output Token — average inter-token latency       |\n| Input throughput   | tok/s | Input tokens processed per second across all requests     |\n| Output throughput  | tok/s | Output tokens generated per second across all requests    |\n| Total throughput   | tok/s | Input + Output token throughput combined                  |\n| Per-GPU throughput | tok/s | Total throughput / number of GPUs                         |\n\n\nPer-GPU throughput is the most important efficiency metric — it shows how well each GPU is utilized. Two configs might have similar total throughput, but the one using fewer GPUs has better per-GPU throughput and is more cost-efficient.\n\n## Common Workspace Layout\n\nThe standard development environment uses `/sgl-workspace` as the root workspace inside Docker containers:\n\n```\n/sgl-workspace/\n├── sglang/                    # sglang source (installed via pip -e, dev mode)\n├── aiter/                     # AITER source (AMD AI Tensor Engine)\n├── mori/                      # Mori (communication library)\n└── <model_short>_<YYYYMMDD>/  # benchmark output directories (created by this skill)\n```\n\nAll benchmark artifacts (logs, reports) are saved under `/sgl-workspace/` by default. If the user specifies a different workspace, use that instead.\n\n## Core Principle: Ask First, Execute Later\n\n**Do NOT guess or assume any configuration.** Every detail must be explicitly confirmed by the user before execution begins. The workflow has two distinct phases:\n\n1. **Planning phase** (Steps 0–1): Gather ALL information through conversation. Ask questions, wait for answers. Do not proceed to the next question until the current one is answered.\n2. **Confirmation gate** (Step 2): Present the complete plan as a summary. Get explicit \"go ahead\" from the user.\n3. **Execution phase** (Steps 3–4): Only after full confirmation, run the benchmarks.\n\nIf at any point you're unsure about a parameter, **ask**. Never fill in a value the user hasn't confirmed.\n\n## Workflow\n\n### Step 0: Model & Environment Discovery\n\n**Ask the user these questions one by one. Wait for each answer before asking the next.**\n\n#### 0a. Model selection — ask this FIRST\n\n**\"Which model do you want to benchmark?\"**\n\nThe user may respond with:\n\n- A full HuggingFace model ID (e.g., `deepseek-ai/DeepSeek-R1-0528`)\n- A short name (e.g., \"DeepSeek R1\", \"Llama 70B\", \"Qwen 235B\")\n- A local path to the model weights\n\nIf the user gives a short name, confirm the exact model ID (e.g., \"Do you mean `deepseek-ai/DeepSeek-R1-0528`?\").\n\n#### 0b. Single-node or multi-node?\n\n**\"Is this single-node or multi-node?\"**\n\n- Single-node: 1 node, typically 8 GPUs\n- Multi-node: ask how many nodes and GPUs per node\n\nIf multi-node, also ask for:\n\n- Network interface (`GLOO_SOCKET_IFNAME`)\n- InfiniBand HCAs (`NCCL_IB_HCA`)\n- Head node IP (`SGLANG_HOST_IP`)\n\n#### 0c. Access the GPU node\n\n**\"How do I access the GPU node?\"**\n\n- SSH command? (e.g., `ssh user@gpu-node`)\n- Docker container? (e.g., `docker exec -it <container> bash`)\n- Already on the machine?\n- For multi-node: ask about access to each node\n\n#### 0d. Probe the environment\n\nOnce connected, probe automatically (no need to ask — just run and report back):\n\n- Run `rocm-smi --showid` → report GPU count, model (MI355X, MI300X, MI308X), architecture\n- Run `pip show sgl-kernel 2>/dev/null && python3 -c \"import sglang; print('sglang version:', sglang.__version__)\"` → report sglang version\n- Run `pip list | grep -i aiter` → report AITER status\n- Check common paths: `/sgl-workspace/sglang`, `/sgl-workspace/aiter`, `/sgl-workspace/mori`\n\n**PYTHONPATH probe (important for Docker environments):** When running inside Docker containers via `docker exec -d` (non-interactive), `.bashrc` is often not sourced due to `[ -z \"$PS1\" ] && return` guards. This can cause `PYTHONPATH` to be missing paths for editable installs (aiter, mori, sglang), leading to import errors like `ImportError: aiter is required when SGLANG_USE_AITER is set to True`. The `serve.sh` script auto-detects and adds common workspace paths (`/sgl-workspace/aiter`, `/sgl-workspace/mori`, `/sgl-workspace/sglang/python`) to `PYTHONPATH` if they exist but are missing. However, if you encounter import errors, compare the environments:\n\n```bash\n# Non-interactive PYTHONPATH (what docker exec -d sees)\ndocker exec <container> bash -c 'echo $PYTHONPATH'\n# Interactive PYTHONPATH (what the user sees)\ndocker exec <container> bash -ic 'echo $PYTHONPATH' 2>/dev/null\n```\n\nIf they differ, ensure the missing paths are exported before running `serve.sh`.\n\n**If any probe reveals a broken package or missing dependency, follow Rule 1 above: report and stop. Do NOT `pip install/uninstall` sglang/aiter/mori or otherwise modify the environment yourself.**\n\n#### 0e. Locate model weights\n\nThe user may or may not have specified where the model weights are stored. If they haven't provided a path, do a quick search — but don't waste time on this:\n\nQuick places to check:\n\n- `$HUGGINGFACE_HUB_CACHE` env var\n- `~/.cache/huggingface/hub/`\n- Common mount points: `/mnt`, `/raid`, `/data`\n\nNote: HuggingFace cache stores models as `models--<Org>--<Name>/snapshots/<hash>/`. For example, `Qwen/Qwen3.5-397B-A17B-FP8` would be at `models--Qwen--Qwen3.5-397B-A17B-FP8/snapshots/<hash>/`. Look for this pattern.\n\nIf you find a match, confirm with the user:\n\n> \"I found what looks like the model weights at `/data/models/DeepSeek-R1-0528/`. Is this the right location?\"\n\nIf nothing turns up quickly, ask:\n\n> \"I couldn't find the model weights on this machine. Where are they stored?\"\n\nThe `--model-path` can be either:\n- A **local path** directly to the weights (e.g., `/data/models/DeepSeek-R1/`)\n- A **HuggingFace model ID** (e.g., `Qwen/Qwen3.5-397B-A17B-FP8`) — but only if the weights already exist in `$HUGGINGFACE_HUB_CACHE`. If the weights are at `$HUGGINGFACE_HUB_CACHE/models--<Org>--<Name>`, using the HF model ID is preferred. You can also `export HUGGINGFACE_HUB_CACHE=<path>` to point to the right cache dir.\n\nDo NOT let sglang trigger a model download — the weights must already be on disk.\n\n#### 0f. Report findings and confirm\n\nPresent everything you found to the user:\n\n> \"Here's what I have so far:\n>\n> - **Model**: deepseek-ai/DeepSeek-R1-0528\n> - **Weights**: /data/models/DeepSeek-R1-0528/\n> - **GPUs**: 8x MI355X (gfx950)\n> - **sglang**: v0.5.x at /sgl-workspace/sglang\n> - **AITER**: installed\n> - **Setup**: single-node\n>\n> Does this look right? Anything I should know about this environment?\"\n\n### Step 1: Configuration Planning\n\n**Ask each of these questions explicitly. Do not move forward until you have clear answers for ALL of them.**\n\n#### 1a. MTP decision (if applicable)\n\nIf the model is MTP-capable (detected via `mtp_num_hidden_layers` in config.json, or known models like DeepSeek-R1/V3, Qwen3.5), ask:\n\n**\"This model supports Multi-Token Prediction (MTP), which can improve decode throughput. MTP is configured by a step count `N` (`MTP=0` disables it; `MTP=N` for `N>0` enables N speculative steps). By default we run with `MTP=0` for a clean baseline. What would you like to do?\"**\n\n1. Run with `MTP=0` only (baseline)\n2. Run with `MTP=N` for a chosen `N` (ask the user for `N`)\n3. Run both `MTP=0` and `MTP=N`, and compare\n\nIf the user wants MTP enabled, determine:\n- **MTP steps** (`MTP=N`, where `N` is an integer ≥ 1, NOT a 0/1 toggle). If unsure, ask the user.\n- **MTP algorithm** (`MTP_ALGO`): model-dependent — see `references/server_config.md` for the per-model table\n\n`serve.sh` handles all speculative decoding flags (`--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens`) automatically from `MTP` and `MTP_ALGO`.\n\n#### 1b. Server setup\n\nCheck if a sglang server is already running — don't ask the user, just probe:\n\n```bash\ncurl -s http://localhost:30000/health && echo \"Server is running\" || echo \"No server running\"\npgrep -fa \"sglang.launch_server\" || true\n```\n\n- If a server is running: inform the user and ask whether to shut it down or use it as-is. By default, shut it down so the skill controls the server lifecycle for each config.\n- If no server is running: good — the skill will launch one for each config.\n\nAsk: **\"Any additional sglang launch flags you want to use?\"** (e.g., `--quantization fp8`, `--chunked-prefill-size`, `--schedule-policy`, etc.)\n\nNote: `--disable-radix-cache` is enabled by default in `serve.sh` for benchmarking. User can opt out with `DISABLE_RADIX_CACHE=0`.\n\n#### 1c. Parallel configurations\n\nThis is the most important decision in the benchmark. Read `references/server_config.md` for the full reference on parallelism types, naming conventions, EP modes, and how to reason about config choices.\n\n**Before asking the user**, do the following:\n\n1. **Read the model's `config.json`** from the weights directory directly (it's short). Look for KV heads, Q heads, expert count, and detect attention type (MLA/GQA/MHA). See `references/server_config.md` for the key fields to look for — but note that field names vary across models, so read carefully.\n2. **Analyze** the 4 factors described in `references/server_config.md` → \"How to Reason About Parallel Config\":\n  - Weight size vs GPU HBM → which TP values fit?\n  - Attention type + KV heads → TP or DP-attention?\n  - MoE vs Dense → EP applicable?\n  - EP mode → all-to-all or all-reduce?\n3. **Present your analysis to the user** — show your reasoning (weight size calc, KV head implications, why certain configs are better). Then present a suggested config table and **ask the user to pick**.\n4. **If EP is involved**, ask which EP mode (all-to-all or all-reduce), or suggest benchmarking both.\n\nWait for the user to respond. If they say \"try all of them\" or \"you decide\", confirm your suggested set before proceeding.\n\n#### 1d. Benchmark sweep parameters\n\n**\"What ISL (input sequence length), OSL (output sequence length), and concurrency levels do you want to sweep?\"**\n\nIf the user isn't sure, offer options but still ask them to pick:\n\n> \"Some common approaches:\n>\n> 1. **Specific pairs** — e.g., (ISL=512, OSL=256), (ISL=1024, OSL=512) — good for simulating real workloads\n> 2. **Full sweep** — provide separate ISL, OSL, and CON lists, benchmark all combinations\n>\n> Which approach? And what values?\"\n\nIf the user says \"you pick\" or \"whatever makes sense\", then suggest values and **ask for confirmation before proceeding**:\n\n> \"Here's what I'd suggest:\n>\n> - ISL: 128, 512, 1024, 2048, 4096\n> - OSL: 128, 512, 1024, 2048\n> - Concurrency: 1, 16, 64, 128, 256\n>\n> That's 5 × 4 × 5 = 100 runs per config, times 2 configs = 200 total runs.\n> Estimated ~3+ hours. Want to proceed with these, or adjust?\"\n\n### Step 2: Confirmation Gate\n\n**Do NOT start any benchmark until this step is complete.**\n\n#### Naming convention\n\nUse this pattern for directories:\n\n```\nBENCH_DIR=/sgl-workspace/<model_short>_<YYYYMMDD>\n```\n\nPer-config dirs: `<CONFIG>_mtp<N>` where `N` is the MTP step count (`0` = off, e.g. `DP8EP8_mtp0`, `TP8_mtp0`, `DP8EP8_mtp3`)\n\n#### Present the plan summary\n\n> **Benchmark Plan Summary**\n>\n>\n> | Item      | Value                                  |\n> | --------- | -------------------------------------- |\n> | Model     | deepseek-ai/DeepSeek-R1-0528           |\n> | GPU       | 8x MI355X                              |\n> | Mode      | Mix (non-disaggregated)                |\n> | Bench dir | `/sgl-workspace/DeepSeek-R1_20260322/` |\n>\n>\n> **Sweep:** ISL=[128, 512, 1024, 2048], OSL=[128, 512, 1024], CON=[1, 16, 64, 128, 256]\n\n#### Confirm configs with dry-run\n\nFor each parallel config, **actually run `scripts/serve.sh` with `DRY_RUN=1`** on the GPU node — do NOT construct the launch command manually. The dry-run output shows the exact command that will be executed, ensuring consistency between what the user confirms and what actually runs.\n\nFor a small number of configs (2-3), present all dry-run outputs at once. For many configs, present them one by one. Get confirmation before proceeding to execution.\n\n```bash\nBENCH_DIR=/sgl-workspace/<model_short>_$(date +%Y%m%d)\n\n# Config 1 — dry run\nMODEL_PATH=<MODEL_PATH> CONFIG=DP8EP8_A2A MTP=0 \\\nLOG_DIR=$BENCH_DIR/DP8EP8_A2A_mtp0 DRY_RUN=1 bash serve.sh\n\n# Config 2 — dry run\nMODEL_PATH=<MODEL_PATH> CONFIG=TP8 MTP=0 \\\nLOG_DIR=$BENCH_DIR/TP8_mtp0 DRY_RUN=1 bash serve.sh\n```\n\nShow the **full dry-run output** (including the complete formatted sglang launch command with all flags) to the user and ask: **\"Do these configs look right?\"**\n\nIf the user wants changes, adjust and re-run the dry run. Once confirmed, proceed to Step 3.\n\n### Step 3: Benchmark Execution\n\nOnly proceed here after the user has confirmed ALL configs in Step 2.\n\n**Always use `serve.sh` and `bench.sh` to launch the server and run benchmarks.** Do NOT construct sglang commands manually — the scripts handle critical flags (`--enable-dp-attention`, `--enable-dp-lm-head`, `SGLANG_USE_AITER`, `PYTHONPATH`, etc.) that are easy to miss.\n\n#### 3-0. Deploy benchmark scripts to the remote node\n\nThe `scripts/serve.sh`, `scripts/bench.sh`, `scripts/stop.sh`, `scripts/verify_stop.sh`, and `scripts/wait_for_server.py` files live in the skill directory on the local machine. `serve.sh`/`bench.sh`/`stop.sh`/`wait_for_server.py` run inside the container; `verify_stop.sh` MUST run on the host (so it can see PIDs from sibling containers).\n\n```bash\n# From local: scripts → remote node → into container (verify_stop.sh stays on the host)\nscp scripts/serve.sh scripts/bench.sh scripts/stop.sh scripts/verify_stop.sh scripts/wait_for_server.py <SSH_HOST>:/tmp/\nssh <SSH_HOST> \"docker cp /tmp/serve.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/bench.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/stop.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/wait_for_server.py <CONTAINER>:/sgl-workspace/\"\n```\n\nAlternatively, if you're already inside the container, write the script content directly using `cat > /sgl-workspace/serve.sh << 'SCRIPT' ... SCRIPT`.\n\n**Important:** Avoid running scripts through nested `ssh → docker exec → bash -c` with inline heredocs — the quoting becomes unmanageable. Always copy scripts to the remote first, then run them simply with `bash serve.sh`.\n\n#### For each parallel config:\n\n**3a. Launch sglang server**\n\nLaunch in background so you can proceed to benchmarking:\n\n```bash\nMODEL_PATH=<MODEL_PATH> CONFIG=<CONFIG> MTP=<N> \\\nLOG_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \\\nBACKGROUND=1 bash serve.sh\n```\n\n`serve.sh` writes the server's stdout+stderr to `$LOG_DIR/server_<LABEL>.log`, which is what satisfies Rule 2 (persistent server log) and what `wait_for_server.py` in 3b reads.\n\nIf the user already has a running server, skip the launch and use their URL.\n\n**3b. Wait for server ready**\n\nPer Rule 3 above, use the bundled `scripts/wait_for_server.py` — do NOT `sleep` blindly and do NOT roll your own `tail -f | grep` loop. The script already handles stall detection (≥ 5 min unchanged) and avoids matching benign substrings like `Ignore import error` / `UserWarning`.\n\n```bash\n# Script was copied to /sgl-workspace/ in 3-0 alongside serve.sh / bench.sh.\nSERVER_LOG=$(ls -t $BENCH_DIR/<CONFIG>_mtp<N>/server_*.log | head -1)\n\npython3 /sgl-workspace/wait_for_server.py \"$SERVER_LOG\"\n# exit codes:\n#   0 READY    — saw \"The server is fired up and ready to roll\"\n#   1 CRASHED  — saw \"Traceback\"; stop and report tail of $SERVER_LOG to user\n#   2 HUNG     — log stalled ≥ --stall-seconds (default 300s); stop and report\n#   3 TIMEOUT  — overall --overall-timeout (default 1800s) exceeded\n#   4 ERROR    — log file unreadable / never appeared\n```\n\nIf AITER JIT compilation legitimately produces long quiet periods on a particular config, bump `--stall-seconds` (and/or `--overall-timeout`) explicitly rather than swallowing a HUNG. On any non-zero exit, **stop** and report the log tail to the user — do NOT silently relaunch.\n\n**3c. Run benchmark**\n\n`bench.sh` no longer writes per-run logs itself. Set `OUTPUT_DIR`; per-run JSONL is written to `${OUTPUT_DIR}/jsonl_dir/` and **you MUST capture stdout+stderr with `2>&1 | tee $OUTPUT_DIR/<name>.log`**.\n\n```bash\nOUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \\\nMODEL_PATH=<MODEL_PATH> ISL=<ISL> OSL=<OSL> \\\nCONCURRENCY=\"<CON1> <CON2> <CON3>\" \\\nbash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL<X>_OSL<Y>.log\n```\n\nFor multiple ISL/OSL combinations, loop (remember `2>&1 | tee` per invocation):\n\n```bash\nexport OUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N>\nfor ISL in 128 512 1024 2048; do\n  for OSL in 128 512 1024; do\n    MODEL_PATH=<MODEL_PATH> ISL=$ISL OSL=$OSL \\\n    CONCURRENCY=\"1 16 64 128 256\" \\\n    bash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL${ISL}_OSL${OSL}.log\n  done\ndone\n```\n\n**3d. Stop server and repeat**\n\nKill sglang inside the container, then verify on the host (sibling-container PIDs are invisible from within the container):\n\n```bash\nssh <SSH_HOST> \"docker exec <CONTAINER> bash /sgl-workspace/stop.sh\"\nssh <SSH_HOST> bash /tmp/verify_stop.sh   # exit 0 = GPUs free; non-zero prints offending PIDs\n```\n\n**If a config crashes:** Report the error, run `stop.sh` then `verify_stop.sh`, and move on to the next config. Do NOT debug kernel issues or retry. Document the crash and error message in the final report.\n\nRepeat 3a–3d for each parallel config.\n\n### Step 4: Report\n\nAfter all configs are benchmarked, generate structured CSV data, a performance plot, and a Markdown report.\n\n#### 4a. Generate CSV from JSONL\n\nFor each config directory, run `jsonl_to_csv.py` to extract metrics into an InferenceX-compatible CSV:\n\n```bash\npython3 /sgl-workspace/jsonl_to_csv.py \\\n  --jsonl-dir $BENCH_DIR/<CONFIG>_mtp<N>/jsonl_dir \\\n  --hardware <HARDWARE> \\\n  --precision <PRECISION> \\\n  --model <MODEL_NAME> \\\n  --date <YYYY-MM-DD> \\\n  --output $BENCH_DIR/<CONFIG>_mtp<N>/<MODEL>_<HARDWARE>_<PRECISION>.csv\n```\n\nRequired args:\n- `--hardware`: GPU hardware name (e.g. `mi355x`, `b200`, `b300`)\n- `--precision`: weight precision (e.g. `fp4`, `fp8`, `bf16`)\n\nOptional args:\n- `--model`: model display name (default: auto-detected from model path)\n- `--date`: benchmark date (default: today)\n- `--output`: output CSV path (default: auto-named in jsonl-dir parent)\n\nThe CSV follows InferenceX format with all standard columns (throughput/GPU, TTFT, TPOT, interactivity, ITL, E2E latency, etc.). Time values are stored in **seconds** (matching InferenceX convention, despite column headers saying \"ms\"). Interactivity = 1000 / TPOT(ms).\n\n#### 4b. Generate performance plot\n\nRun `plot_interactivity.py` to produce a **Token Throughput per GPU vs. Interactivity** chart from one or more CSVs:\n\n```bash\npython3 /sgl-workspace/plot_interactivity.py \\\n  $BENCH_DIR/<CONFIG1>/<CSV1>.csv \\\n  $BENCH_DIR/<CONFIG2>/<CSV2>.csv \\\n  -o $BENCH_DIR/interactivity_plot.png\n```\n\nYou can also include reference CSVs (e.g. from InferenceX) alongside your benchmark CSVs to produce comparison plots. Optional args: `--title`, `--subtitle`, `--dpi` (default: 150).\n\n#### 4c. Write Markdown report\n\nWrite a Markdown report to `$BENCH_DIR/benchmark_report.md` that includes:\n\n- Configuration summary (model, GPUs, mode, MTP status)\n- Per-config results tables with all metrics + per-GPU throughput\n- Cross-config comparison highlighting the best performer for each metric\n- Reference to the generated CSV and plot files\n\nPresent the report to the user and walk them through the key findings.\n\n## File Organization\n\n```\n/sgl-workspace/<model_short>_<YYYYMMDD>/\n├── benchmark_report.md                          # final report\n├── DP4EP4_mtp0/                                 # per-config directory\n│   ├── server_DP4EP4_mtp0.log                   # sglang server log (from serve.sh)\n│   ├── bench_ISL4096_OSL1024.log                # bench.sh stdout/stderr (you capture via `2>&1 | tee`)\n│   └── jsonl_dir/                               # raw JSONL written by bench.sh --output-file\n│       ├── bench_ISL4096_OSL1024_CON64.jsonl\n│       ├── bench_ISL4096_OSL1024_CON128.jsonl\n│       └── ...\n├── TP8_mtp0/\n│   ├── server_TP8_mtp0.log\n│   └── ...\n└── DP8EP8_A2A_mtp1/\n    └── ...\n```\n\nEach config gets its own directory. `serve.sh` writes `server_<LABEL>.log` into `LOG_DIR`. `bench.sh` writes JSONL into `OUTPUT_DIR`; capture its stdout/stderr to the same `OUTPUT_DIR` via `2>&1 | tee $OUTPUT_DIR/<bench>.log`.\n","tags":{"latest":"0.1.0"},"stats":{"comments":0,"downloads":312,"installsAllTime":0,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1778174512924,"updatedAt":1778492872010},"latestVersion":{"version":"0.1.0","createdAt":1778174512924,"changelog":"Initial release","license":"MIT-0"},"metadata":null,"owner":{"handle":"alexsun07","userId":"s1708fwj15tby70h9fdzhj9h11868a78","displayName":"Alex Sun","image":"https://avatars.githubusercontent.com/u/192758607?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780090759116}}