Install
openclaw skills install @kkellyoffical/nvidia-cudaUse when work targets NVIDIA GPUs for deep learning training, inference, distributed execution, CUDA/Triton kernels, or AI infra tuning. Enforces GPU-aware code conventions for PyTorch on CUDA, including dtype policy, memory movement, NCCL/DDP/FSDP choices, profiling, benchmarking, and H100/H200/B200 optimization habits.
openclaw skills install @kkellyoffical/nvidia-cudaUse this skill when the task involves NVIDIA GPU training or inference, multi-GPU execution, CUDA/Triton kernels, or CUDA-specific performance/debugging work.
This skill is for implementation and review, not generic theory. It should push the work toward:
Probe the actual machine before acting. This skill is written for modern NVIDIA Tensor Core systems and is especially opinionated for:
It assumes the agent should actively choose precision, attention backends, sharding strategy, logging frequency, and dataloader settings instead of inheriting slow defaults.
Before changing code, inspect the actual runtime surface:
nvidia-smipython -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"CUDA_VISIBLE_DEVICEStorch.cuda.device_count()torch.cuda.get_device_properties(i)Do not jump to kernel rewrites before ruling out bad data movement, wrong dtype, graph breaks, or poor distributed setup.
Use the included scripts when you need deterministic probes instead of ad hoc snippets:
scripts/cuda_env_probe.py: collect CUDA, PyTorch, device, and env factsscripts/check_training_stack.py: scan Python code for high-cost anti-patternsscripts/benchmark_attention.py: benchmark SDPA, flash, cuDNN, and official flash-implementation activation pathsscripts/training_step_benchmark.py: benchmark a synthetic transformer training step with dtype, compile, and .item() logging knobsscripts/dataloader_benchmark.py: benchmark DataLoader worker, pinning, and prefetch settingsscripts/nccl_smoke.py: run a minimal NCCL all-reduce smoke test under torchrunscripts/ddp_fsdp_smoke.py: run a one-step DDP or FSDP training smoke test under torchrunRun the probe first, then benchmark or scan the real workload path.
When the user wants current NVIDIA GPU recommendations, read:
Keep recommendation output scenario-based:
Do not recommend by peak FLOPS alone. Weight memory, interconnect, thermals, deployment form, and software maturity.
When the target includes Triton kernels, CUDA or C++ files, or distributed launcher code:
scripts/check_training_stack.py across the whole tree, not just Python subfoldersscripts/nccl_smoke.py or scripts/ddp_fsdp_smoke.py before claiming they are healthyscripts/benchmark_attention.py --list-flash-impls and explicit activation when availableTreat the following as red flags in reviews and optimization work:
torch.compile on stable hot paths.item() every step for logging and metricsIf several of these exist at once, assume the GPU is underfed until proven otherwise.
bfloat16 for training and inference unless there is a measured or correctness-driven reason not to.float16 only when a dependency or model explicitly benefits from it.float16 training, use torch.amp.GradScaler("cuda").torch.amp.autocast("cuda", dtype=...)torch.cuda.amp.*torch.set_float32_matmul_precision("high").pin_memory=True in DataLoader when feeding CUDA.tensor.to("cuda", non_blocking=True) or equivalent only when the source is pinned or already async-safe.Inside hot loops, avoid or justify:
.item().cpu().numpy()print(cuda_tensor)torch.cuda.synchronize()These often force host synchronization and distort both throughput and profiling.
torch.compile rulestorch.compile to stable hot paths.torch.compile categorically for CUDA models. Benchmark it.mode="default" for general speedupsmode="reduce-overhead" for small-batch / latency-sensitive CUDA paths with stable shapesmode="max-autotune" when compile overhead is acceptable and the path is hot enoughdynamic=True or accept recompiles intentionally.TORCH_LOGS=perf_hints or TORCH_LOGS=dynamic when diagnosing graph breaks or overspecialization.scaled_dot_product_attention and PyTorch attention backends before custom kernels.torch.nn.attention.flex_attention only after confirming SDPA cannot express the workload cleanly..item() sync points, or allocator churn in the captured region.torchrun and one process per GPU.use_reentrant=False unless there is a specific reason otherwise.Use this section when the user explicitly wants leading-edge optimization or when the baseline is already healthy.
Do not apply all of these at once. Climb the ladder in order and keep a benchmark after each step.
For H100/H200/B200-class transformer workloads, the default advanced path is:
torch.compileFP8 is not a blanket default. It is a targeted transformer optimization with strict measurement.
For models that are too large, too long-context, or too communication-heavy, consider:
Selection heuristics:
For autoregressive serving, the advanced inference ladder is:
Strong fits:
Do not merge a quantization change without an explicit quality gate.
When optimizing training code, prefer this order:
pin_memory=Truepersistent_workers=True when the workload is steady and worker startup cost mattersprefetch_factortorch.compileWhen optimizing inference code:
model.eval() and torch.no_grad() / torch.inference_mode()torch.compile or engine compilation only after correctness baselinesBaseline expectations:
torchrunncclMASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE, LOCAL_RANKParallelism heuristics:
DP first for ordinary scale-outFSDP2 when full replication is too expensiveFSDP / ZeRO-2 / ZeRO-3 before checkpointing when optimizer, grad, or param state replication is the memory bottleneckTP + SP for large transformer blocksCP for long-context modelsPP for deep stacks that still do not fitEP for MoEDebugging rules:
NCCL_DEBUG=INFONCCL_DEBUG_SUBSYS=COLL,GRAPH when debugging hangs or topology issuesNCCL_SOCKET_IFNAMETORCH_NCCL_USE_COMM_NONBLOCKING=1 when you specifically want non-blocking NCCL error handlingDo not hardcode socket or thread tuning variables unless profiling or production evidence shows the default topology tuning is wrong.
Only drop to Triton or CUDA when operator fusion, layout cleanup, torch.compile, or library kernels are not enough.
For custom kernels:
128 to 256 threads per block unless the kernel shape clearly suggests otherwiseDo not claim a kernel is faster because occupancy increased. Show end-to-end or kernel-level timing.
Use Transformer Engine when all of these are true:
Rules:
Prefer TE modules or established framework integration over custom FP8 plumbing.
For long-sequence transformer work:
For decoder serving:
.item() every step for loss curves, metric dashboards, or debug prints.N stepsFor CUDA training jobs, always inspect the input path before blaming kernels.
Required checks:
num_workerspin_memorypersistent_workersprefetch_factorRules:
prefetch_factor and worker count together; more workers without enough prefetch often just moves the bottleneck.Allowed as targeted tools, not blanket defaults:
CUDA_VISIBLE_DEVICESCUDA_LAUNCH_BLOCKING=1 for debugging onlyPYTORCH_ALLOC_CONFTORCH_CUDNN_V8_API_LRU_CACHE_LIMITTORCH_ALLOW_TF32_CUBLAS_OVERRIDETORCH_NCCL_USE_COMM_NONBLOCKING=1NCCL_DEBUGNCCL_DEBUG_SUBSYSNCCL_SOCKET_IFNAMEIf you set an env var in code, config, or docs, explain:
When reviewing or generating code in this domain, check for:
pin_memory=True on CUDA dataloadersnon_blocking=True on pinned copiespersistent_workers=True or untuned prefetch_factor on steady-state training jobs.item() / .cpu() / .numpy() inside the step loopWhen using this skill, report in this order:
If you need deeper justification or exact source-backed tuning notes, read references/official-notes.md.