NVIDIA CUDA
Use this skill when the task involves NVIDIA GPU training or inference, multi-GPU execution, CUDA/Triton kernels, or CUDA-specific performance/debugging work.
This skill is for implementation and review, not generic theory. It should push the work toward:
- correct GPU usage first
- stable and measurable performance second
- low-risk tuning before low-level kernel work
- frontier algorithms only when the workload shape justifies them
Hardware stance
Probe the actual machine before acting. This skill is written for modern NVIDIA Tensor Core systems and is especially opinionated for:
- H100 / Hopper
- H200
- B200 / Blackwell-class accelerators
It assumes the agent should actively choose precision, attention backends, sharding strategy, logging frequency, and dataloader settings instead of inheriting slow defaults.
Use this skill for
- PyTorch training or inference on CUDA
- GPU memory, throughput, or latency bottlenecks
- DDP / FSDP / NCCL decisions
- CUDA env var and runtime hygiene
- Triton or custom CUDA kernel review
- benchmark and profiler setup for NVIDIA GPUs
Do not use this skill for
- CPU-only optimization
- ROCm / AMD-specific work
- TPU / XLA-specific work
- vague "make it faster" requests without inspecting the actual GPU path first
First-pass workflow
Before changing code, inspect the actual runtime surface:
- Check hardware and driver:
nvidia-smi
- Check framework build:
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"
- Check device visibility and topology:
CUDA_VISIBLE_DEVICES
torch.cuda.device_count()
torch.cuda.get_device_properties(i)
- Check whether the bottleneck is:
- input pipeline
- host/device transfer
- eager Python overhead
- kernel efficiency
- distributed communication
- memory pressure / fragmentation / OOM
Do not jump to kernel rewrites before ruling out bad data movement, wrong dtype, graph breaks, or poor distributed setup.
Bundled tooling
Use the included scripts when you need deterministic probes instead of ad hoc snippets:
scripts/cuda_env_probe.py: collect CUDA, PyTorch, device, and env facts
scripts/check_training_stack.py: scan Python code for high-cost anti-patterns
scripts/benchmark_attention.py: benchmark SDPA, flash, cuDNN, and official flash-implementation activation paths
scripts/training_step_benchmark.py: benchmark a synthetic transformer training step with dtype, compile, and .item() logging knobs
scripts/dataloader_benchmark.py: benchmark DataLoader worker, pinning, and prefetch settings
scripts/nccl_smoke.py: run a minimal NCCL all-reduce smoke test under torchrun
scripts/ddp_fsdp_smoke.py: run a one-step DDP or FSDP training smoke test under torchrun
Run the probe first, then benchmark or scan the real workload path.
Planning hardware purchases
When the user wants current NVIDIA GPU recommendations, read:
Keep recommendation output scenario-based:
- cost-sensitive local prototyping
- serious workstation development
- enterprise server deployment
- single-node training
- rack-scale training or reasoning
Do not recommend by peak FLOPS alone. Weight memory, interconnect, thermals, deployment form, and software maturity.
Reviewing Triton, CUDA, and distributed code
When the target includes Triton kernels, CUDA or C++ files, or distributed launcher code:
- use
scripts/check_training_stack.py across the whole tree, not just Python subfolders
- treat Triton kernels as first-class review targets
- verify distributed paths with
scripts/nccl_smoke.py or scripts/ddp_fsdp_smoke.py before claiming they are healthy
- benchmark flash attention implementation changes with
scripts/benchmark_attention.py --list-flash-impls and explicit activation when available
Non-negotiable code conventions
High-cost anti-patterns to ban
Treat the following as red flags in reviews and optimization work:
- keeping transformer training or inference in full FP32 by default on Tensor Core GPUs
- refusing to try
torch.compile on stable hot paths
- staying on a generic default SDPA path on H200/B200 when newer flash-backed implementations or cuDNN/TE attention backends are available and benchmarkable
- reaching for activation checkpointing before FSDP / ZeRO-style sharding when the real problem is replicated params, grads, or optimizer state
- calling
.item() every step for logging and metrics
- leaving DataLoader workers, pinning, prefetching, and persistent workers untuned until GPU starvation shows up
If several of these exist at once, assume the GPU is underfed until proven otherwise.
1. Dtype policy
- On H100 / Hopper, prefer
bfloat16 for training and inference unless there is a measured or correctness-driven reason not to.
- On H200 / B200-class systems, treat BF16 or FP8-capable paths as the default starting point for deep learning workloads.
- Full FP32 by default is a correctness mode, not a performance baseline.
- Use
float16 only when a dependency or model explicitly benefits from it.
- If using
float16 training, use torch.amp.GradScaler("cuda").
- For mixed precision, use the current API:
torch.amp.autocast("cuda", dtype=...)
- not deprecated
torch.cuda.amp.*
2. FP32 fallback policy
- If the model is nominally FP32 but does not require strict IEEE-style matmul precision, prefer enabling faster matmul paths with
torch.set_float32_matmul_precision("high").
- Disable TF32 only for reproducibility or numerics investigations.
- Never silently mix "strict numerics" and "fast numerics" requirements in the same change. State the choice.
3. Device movement rules
- Keep tensors on device across the hot path.
- Move data in batches, not item-by-item.
- Use
pin_memory=True in DataLoader when feeding CUDA.
- Use
tensor.to("cuda", non_blocking=True) or equivalent only when the source is pinned or already async-safe.
- Avoid CPU round-trips in the training or inference loop.
4. Avoid accidental synchronization
Inside hot loops, avoid or justify:
.item()
.cpu()
.numpy()
print(cuda_tensor)
- timing without
torch.cuda.synchronize()
These often force host synchronization and distort both throughput and profiling.
5. Benchmarking rules
- Warm up before measuring.
- Use CUDA events or a framework benchmark harness, not bare wall clock alone.
- Synchronize before reading timings.
- Record batch size, dtype, sequence/image shape, compile state, and number of warmup / measured iterations.
- For throughput and latency claims, include the exact command or code path used.
6. torch.compile rules
- Reach correctness first in eager mode.
- Then apply
torch.compile to stable hot paths.
- Do not reject
torch.compile categorically for CUDA models. Benchmark it.
- Prefer:
mode="default" for general speedups
mode="reduce-overhead" for small-batch / latency-sensitive CUDA paths with stable shapes
mode="max-autotune" when compile overhead is acceptable and the path is hot enough
- If shapes are unstable, be explicit about
dynamic=True or accept recompiles intentionally.
- Use
TORCH_LOGS=perf_hints or TORCH_LOGS=dynamic when diagnosing graph breaks or overspecialization.
6.1 Attention kernel policy
- For transformer attention, prefer native
scaled_dot_product_attention and PyTorch attention backends before custom kernels.
- Use PyTorch SDPA / FlashAttention backends as the default fast path on CUDA.
- On H200 / B200-class systems, do not assume the default shipped SDPA backend is the best available path. Check whether FA3 / FA4 implementations are registered and benchmark them.
- Where Transformer Engine or cuDNN attention backends are available, benchmark them against the default path on Hopper/Blackwell instead of assuming SDPA wins.
- If masks or score transforms are unusual, consider
torch.nn.attention.flex_attention only after confirming SDPA cannot express the workload cleanly.
- For variable-length packed batches, prefer varlen / packed attention paths rather than padding to worst-case sequence length.
- Do not keep a hand-rolled eager attention implementation in a hot path if SDPA, TE attention, or TensorRT-LLM fused attention can replace it.
6.2 CUDA Graphs policy
- Use CUDA Graphs only for repeated, shape-stable hot paths where CPU launch overhead is material.
- Graph capture is a strong fit for:
- small-batch inference
- fixed-shape decode loops
- stable training microbatches
- Do not graph workloads with live shape churn,
.item() sync points, or allocator churn in the captured region.
- Prefer graphing after eager correctness and after compile/engine selection are already settled.
7. Distributed rules
- Use NCCL for distributed CUDA training.
- Use DDP when the model fits on one GPU and you mainly need data parallel scale-out.
- Use FSDP2 when the model does not fit on one GPU or optimizer state dominates memory.
- When tensor parallelism is enabled, sequence parallelism is the default companion unless a framework limitation blocks it.
- For long-context transformer training, consider context parallelism before blindly increasing tensor parallel degree.
- For MoE, use expert parallelism instead of replicating all experts everywhere.
- For large-scale runs, consider a distributed optimizer before inventing custom sharding logic.
- Use
torchrun and one process per GPU.
- Set the current CUDA device from rank/local-rank early and only once.
- Do not tune NCCL environment variables by superstition. Change them only after evidence.
8. Memory rules
- Prefer algorithmic memory wins before allocator tweaks:
- BF16
- FSDP / sharding
- distributed optimizer
- smaller activation footprints
- sequence or batch shaping
- activation checkpointing
- When the memory problem is replicated model state, prioritize FSDP / ZeRO-2 / ZeRO-3 style sharding before gradient checkpointing.
- Use checkpointing when activation memory is the real limiter, not as a substitute for state sharding.
- If activation checkpointing is needed, prefer
use_reentrant=False unless there is a specific reason otherwise.
- Treat allocator and workspace env vars as debugging/tuning levers, not first-line fixes.
9. Profiling rules
- Start with Nsight Systems for timeline truth:
- CPU launch gaps
- H2D / D2H copies
- stream overlap
- NCCL overlap and stalls
- Unified Memory migrations
- Move to Nsight Compute only after isolating the hot kernels.
- For PyTorch-level attribution, use the profiler only if you need operator names or stack grouping.
Frontier algorithm ladder
Use this section when the user explicitly wants leading-edge optimization or when the baseline is already healthy.
Do not apply all of these at once. Climb the ladder in order and keep a benchmark after each step.
A. Hopper and Blackwell transformer path
For H100/H200/B200-class transformer workloads, the default advanced path is:
- BF16 baseline
- native SDPA / FlashAttention fast path
torch.compile
- CUDA Graphs if shapes stabilize
- Transformer Engine FP8 if:
- the model is transformer-dominated
- BF16 numerics are already validated
- there is an accuracy gate
FP8 is not a blanket default. It is a targeted transformer optimization with strict measurement.
B. Large-model training algorithms
For models that are too large, too long-context, or too communication-heavy, consider:
- FSDP2 / Megatron FSDP for parameter, grad, and optimizer sharding
- distributed optimizer for optimizer-state memory pressure
- tensor parallelism for large hidden dimensions
- sequence parallelism when TP is enabled
- context parallelism for long sequences
- pipeline parallelism for very deep models
- expert parallelism for MoE
- activation checkpointing when the above is still insufficient
Selection heuristics:
- if the model barely misses single-GPU fit: start with FSDP2
- if hidden dimensions are very large: add TP
- if sequence length is the actual problem: prefer CP over just raising TP
- if the model is MoE: use EP rather than replicating experts
- if optimizer state dominates: distributed optimizer or FSDP shard first
C. LLM inference algorithms
For autoregressive serving, the advanced inference ladder is:
- fused attention path
- paged KV cache / paged context handling
- continuous or inflight batching
- chunked prefill / context chunking for large prompts
- CUDA Graphs for stable decode loops
- speculative decoding when low-batch latency matters and acceptance rate is healthy
- FP8 or weight-focused quantization only after accuracy validation
Strong fits:
- paged context / chunked prefill: long prompts and mixed request sizes
- speculative decoding: low-batch underutilized serving
- inflight fused batching: multi-request serving with scheduler pressure
- FP8 serving: Hopper/Blackwell inference paths with validated quality
D. Quantization policy
- On Hopper and newer, prefer BF16 first, FP8 second, INT8/INT4 only when memory or throughput targets require it.
- For training or finetuning quantized targets, prefer QAT or recipe-backed flows over ad hoc fake-quant code.
- For serving, keep a clear boundary between:
- engine build or calibration
- runtime benchmarking
- accuracy evaluation
Do not merge a quantization change without an explicit quality gate.
Training optimization playbook
When optimizing training code, prefer this order:
- Ensure correct device placement and eliminate CPU fallbacks
- Fix input pipeline:
pin_memory=True
- enough workers
persistent_workers=True when the workload is steady and worker startup cost matters
- tuned
prefetch_factor
- no CPU-heavy transforms inside the critical path unless justified
- Set precision policy:
- BF16 first on Hopper
- BF16 / FP8-capable path first on H200/B200
- FP16 only with explicit reason
- Remove accidental syncs and per-step Python overhead
- Try
torch.compile
- If this is transformer-heavy, move attention to SDPA / FlashAttention / TE attention fast paths
- Scale to DDP or FSDP2 if single-GPU utilization or model size requires it
- Add TP + SP, CP, PP, or EP only according to the actual bottleneck
- Add activation checkpointing if memory still blocks batch size
- For Hopper/Blackwell transformer stacks, evaluate Transformer Engine FP8 behind an accuracy gate
- Profile and only then consider Triton or custom CUDA work
Inference optimization playbook
When optimizing inference code:
- Use
model.eval() and torch.no_grad() / torch.inference_mode()
- Pick dtype deliberately:
- BF16 or FP16 on Hopper when numerically acceptable
- TF32-enabled FP32 if accuracy must stay near FP32 but pure FP32 is too slow
- Move attention and decode loops onto fused kernels or native SDPA paths
- Stabilize shapes if possible
- Apply
torch.compile or engine compilation only after correctness baselines
- For LLM serving, evaluate paged KV / paged context and inflight batching early
- Use CUDA Graphs only for repeated, fixed-shape paths where launch overhead matters
- If low-batch latency dominates, evaluate speculative decoding with acceptance-rate tracking
- Benchmark after warmup, with synchronization
- If using TensorRT or TensorRT-LLM, profile runtime separately from engine build
Multi-GPU and NCCL playbook
Baseline expectations:
- launch with
torchrun
- backend
nccl
- one process per GPU
- clear
MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE, LOCAL_RANK
Parallelism heuristics:
DP first for ordinary scale-out
FSDP2 when full replication is too expensive
FSDP / ZeRO-2 / ZeRO-3 before checkpointing when optimizer, grad, or param state replication is the memory bottleneck
TP + SP for large transformer blocks
CP for long-context models
PP for deep stacks that still do not fit
EP for MoE
- combine only what the bottleneck requires
Debugging rules:
- First-line NCCL debug:
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=COLL,GRAPH when debugging hangs or topology issues
- If interface auto-detection is wrong, set
NCCL_SOCKET_IFNAME
- Use
TORCH_NCCL_USE_COMM_NONBLOCKING=1 when you specifically want non-blocking NCCL error handling
Do not hardcode socket or thread tuning variables unless profiling or production evidence shows the default topology tuning is wrong.
Custom CUDA / Triton kernel rules
Only drop to Triton or CUDA when operator fusion, layout cleanup, torch.compile, or library kernels are not enough.
For custom kernels:
- favor coalesced global memory access
- treat shared memory as a tool, not a reflex
- watch register pressure; occupancy is a means, not the target
- use block sizes that are multiples of 32
- start experiments around
128 to 256 threads per block unless the kernel shape clearly suggests otherwise
- avoid giant one-block-per-SM thinking; multiple resident blocks often help latency hiding
- measure spills, occupancy, achieved bandwidth, and kernel time before and after changes
Do not claim a kernel is faster because occupancy increased. Show end-to-end or kernel-level timing.
Transformer Engine and FP8 rules
Use Transformer Engine when all of these are true:
- the workload is transformer-dominated
- the target GPUs are Hopper/Ada/Blackwell-class
- BF16 baseline is already correct
- you can run an accuracy gate
Rules:
- wrap only the forward pass in TE FP8 autocast
- use recipe-backed scaling, not home-grown FP8 metadata logic
- start with delayed scaling or hybrid FP8 recipes
- keep optimizer and master-weight behavior explicit
- if FP8 helps throughput but destabilizes training, fall back to BF16 instead of layering hacks
Prefer TE modules or established framework integration over custom FP8 plumbing.
Attention and sequence-shape rules
For long-sequence transformer work:
- prefer fused attention over matmul-softmax-matmul Python compositions
- on H200/B200, actively test newer flash attention implementations when available instead of passively staying with the default SDPA stack
- prefer grouped or multi-query attention-aware fast paths when the model architecture already uses GQA or MQA
- prefer context parallelism and packed or varlen attention before wasting memory on blanket padding
- if custom masking is the only blocker, test FlexAttention before writing a new CUDA kernel
For decoder serving:
- KV cache layout is part of the algorithm, not just a storage detail
- paged KV and chunked context handling are first-class optimization choices
- scheduler choice affects TTFT and ITL, so benchmark serving algorithms at the scheduler level, not only kernel level
Logging and metric collection rules
- Do not call
.item() every step for loss curves, metric dashboards, or debug prints.
- Aggregate metrics on device and synchronize periodically:
- every
N steps
- end of an accumulation window
- end of an evaluation window
- If exact per-step host-visible metrics are required for debugging, state that this is a debug-only mode.
- Prefer batched reductions and detached device-side buffers over step-wise host sync.
Data pipeline and prefetch rules
For CUDA training jobs, always inspect the input path before blaming kernels.
Required checks:
num_workers
pin_memory
persistent_workers
prefetch_factor
- CPU decode, tokenize, or augment cost
- storage throughput and remote filesystem latency
Rules:
- If GPU utilization dips between steps, inspect the dataloader before touching kernels.
- Tune
prefetch_factor and worker count together; more workers without enough prefetch often just moves the bottleneck.
- Keep host preprocessing out of the critical path when possible.
- For very fast H100/H200/B200 training loops, assume untuned input pipelines are guilty until measurements say otherwise.
Reproducibility and numerics rules
- Be explicit when a change trades accuracy for speed.
- cuDNN determinism is not guaranteed across architectures.
- Across devices or architectures, do not promise bitwise-identical outputs.
- If NaNs or Infs appear:
- check data first
- check loss scaling or dtype policy
- check reduced-precision reductions
- inspect unstable linalg paths
Environment variable policy
Allowed as targeted tools, not blanket defaults:
CUDA_VISIBLE_DEVICES
CUDA_LAUNCH_BLOCKING=1 for debugging only
PYTORCH_ALLOC_CONF
TORCH_CUDNN_V8_API_LRU_CACHE_LIMIT
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE
TORCH_NCCL_USE_COMM_NONBLOCKING=1
NCCL_DEBUG
NCCL_DEBUG_SUBSYS
NCCL_SOCKET_IFNAME
If you set an env var in code, config, or docs, explain:
- why it is needed
- whether it is for debugging, correctness, or performance
- when it should be removed
Review checklist
When reviewing or generating code in this domain, check for:
- tensors bouncing between CPU and GPU
- missing
pin_memory=True on CUDA dataloaders
- missing
non_blocking=True on pinned copies
- missing
persistent_workers=True or untuned prefetch_factor on steady-state training jobs
- deprecated AMP API usage
- full FP32 training or inference on Hopper/Blackwell with no numerics justification
- FP16 used by default on Hopper with no reason
- no BF16 or FP8 path evaluation on H200/B200-class hardware
- no warmup in benchmarks
- no synchronize around timing
.item() / .cpu() / .numpy() inside the step loop
- distributed code using non-NCCL backend for CUDA
- DDP used where model clearly does not fit, instead of FSDP2
- activation checkpointing used to paper over what should be FSDP or ZeRO sharding
- TP enabled without SP where the framework expects SP
- long-context workloads scaled by TP alone when CP is the better lever
- transformer stack on Hopper staying on unfused attention with no reason
- transformer stack on H200/B200 never testing newer flash attention implementations
- speculative decoding added with no acceptance-rate measurement
- quantization merged with no quality gate
- FP8 added without recipe-backed scaling or accuracy validation
- custom kernel work started before simpler wins were exhausted
Output contract
When using this skill, report in this order:
- runtime facts
- bottleneck hypothesis
- chosen optimization order
- changes made
- verification evidence
- remaining risks
If you need deeper justification or exact source-backed tuning notes, read references/official-notes.md.