Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Nccl Optimizer

v1.1.0

Detect the optimal NCCL configuration for distributed GPU training on this machine. Checks GPU topology (NVLink/PCIe), whether RDMA (InfiniBand / RoCE) is av...

0· 178· 2 versions· 0 current· 0 all-time· Updated 22h ago· MIT-0
byRui@mitsuha-m

NCCL Optimizer

Finds the best NCCL communication configuration for distributed training with clear separation of intra-node and inter-node bandwidth metrics.

What it does

  1. GPU topologynvidia-smi topo -m to detect NVLink vs PCIe.
  2. RDMA checkibv_devinfo PORT_ACTIVE state for InfiniBand/RoCE.
    • ✅ RDMA → emit recommended NCCL_IB_* env-vars.
    • ❌ No RDMA → socket benchmark sweep.
  3. Intra-node all-reduce — sweeps NCCL_SOCKET_IFNAME × NCCL_NET_GDR_LEVEL × NCCL_IB_TIMEOUT, runs all_reduce_perf -g <N>, picks best bus bandwidth.
  4. Intra-node P2Pp2p_bw for GPU↔GPU pair bandwidth (if available).
  5. Inter-node benchmark — if nodes= passed, runs MPI all_reduce_perf across nodes; otherwise emits a ready-to-run command.

Prerequisites

ToolPurposeInstall
nvidia-smiGPU info + topologyNVIDIA driver
ibv_devinfoRDMA detectionapt install ibverbs-utils
all_reduce_perfCollective benchmarkSee below
p2p_bwPeer-to-peer benchmarkSame nccl-tests build
mpirunInter-node benchmarkapt install openmpi-bin

Build nccl-tests

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
# For V100 (sm_70), A100 (sm_80), A800 (sm_80), H100 (sm_90):
make -j$(nproc) CUDA_HOME=/usr/local/cuda \
  NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"
export PATH=$PWD/build:$PATH

Usage

# Intra-node only
openclaw skill run nccl_optimizer

# Include inter-node benchmark (requires passwordless SSH + MPI)
openclaw skill run nccl_optimizer "nodes=10.0.0.1,10.0.0.2"

Metrics explained

MetricWhat it measures
All-reduce bus BW (intra)Collective throughput across local GPUs — relevant for single-node training
P2P bandwidthGPU↔GPU direct copy speed (NVLink ≫ PCIe)
All-reduce bus BW (inter)Collective throughput across nodes — bottleneck for multi-node training

Notes

  • Bus bandwidth normalises for GPU count: (N-1)/N × data / time. Compare at same N.
  • Multi-node training is almost always bottlenecked by inter-node bandwidth, not intra-node.
  • RDMA (InfiniBand/RoCE) typically gives 10-100× better inter-node bandwidth than TCP.

Version tags

latestvk97d893rtavtfc8wah6fwv2ked839abw