Install
openclaw skills install nvidia-smiProvides detailed guidance on using the nvidia-smi command for real-time NVIDIA GPU monitoring, management, troubleshooting, and automation.
openclaw skills install nvidia-smiName: nvidia-smi-mastery
Description: Provides a comprehensive guide to using the nvidia-smi command for monitoring, managing, and troubleshooting NVIDIA GPU devices, from basic status checks to advanced scripting.
Keywords: ["nvidia-smi", "gpu monitoring", "nvml", "cuda", "deep learning", "gpu troubleshooting", "system administration"]
Objective
To transform users from basic nvidia-smi observers into power users capable of advanced GPU diagnostics, performance tuning, and automated monitoring.
nvidia-smi (NVIDIA System Management Interface) is the primary command-line utility for monitoring and managing NVIDIA GPUs. It acts as a "dashboard," providing real-time data on your GPU's health and performance.
nvidia-smi is a client for the NVIDIA Management Library (NVML), a C library that provides the interface to the GPU driver. This means you can also use NVML bindings in languages like Python (pynvml) to build custom monitoring tools.Running nvidia-smi provides a table of information. Here's how to interpret the key fields:
0, 1) in a multi-GPU system.Used / Total VRAM.Basic Monitoring
nvidia-smiwatch or the built-in loop option for real-time updates.
watch -n 1 nvidia-smi (Updates every 1 second)nvidia-smi -l 1 (Native loop, updates every 1 second)Targeted Queries
nvidia-smi -L provides a clean list of all detected GPUs and their UUIDs.nvidia-smi -i 0 restricts the view to GPU 0.nvidia-smi -i 0 -q gives a verbose, detailed report for GPU 0, including clock speeds, temperature limits, and ECC status.-d flag to display only specific categories.
nvidia-smi -q -d MEMORY,POWER shows only memory and power data.Process Management
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv provides a clean CSV output of all compute processes, which is perfect for parsing in scripts.Scenario 1: Diagnosing "CUDA out of memory"
nvidia-smi before starting your task to see how much VRAM is already occupied by other processes (e.g., a display manager).kill <PID> to terminate the process and free up the VRAM.Scenario 2: Troubleshooting Performance Drops
If your training loop suddenly slows down, use continuous monitoring (watch -n 0.5 nvidia-smi) and look for:
Temp followed by a change in Perf state (e.g., from P2 to P8) indicates the GPU is reducing its clock speed to cool down.Pwr:Usage is constantly hitting the Cap, the GPU might be limited by its power budget.Daemon Mode (dmon)
For a more compact, columnar output ideal for long-term monitoring, use the daemon mode.
nvidia-smi dmon -s pumt -d 1-s pumt: Specifies the metrics to show: power, utilization, memory, temperature.-d 1: Sets the update interval to 1 second.You can use nvidia-smi in shell scripts or Python to automate monitoring and logging.
**Python Scripting with **pynvml
Since nvidia-smi is a wrapper for NVML, you can use the pynvml library for more direct and faster access in Python.
import pynvml
import time
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
print(f"Found {device_count} GPU(s)")
try:
while True:
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0 # Convert mW to W
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
print(f"[{i}] {name}")
print(f" Temp: {temp}°C")
print(f" Power: {power:.2f}W")
print(f" Memory: {mem_info.used // 1024**2} / {mem_info.total // 1024**2} MB")
print(f" GPU Util: {util.gpu}%")
print("-" * 20)
time.sleep(2)
except KeyboardInterrupt:
print("\nMonitoring stopped.")
pynvml.nvmlShutdown()