Nvidia smi tutorial

Automation

Provides detailed guidance on using the nvidia-smi command for real-time NVIDIA GPU monitoring, management, troubleshooting, and automation.

Install

openclaw skills install nvidia-smi

Skill: Deep Dive into 'nvidia-smi'

Name: nvidia-smi-mastery Description: Provides a comprehensive guide to using the nvidia-smi command for monitoring, managing, and troubleshooting NVIDIA GPU devices, from basic status checks to advanced scripting. Keywords: ["nvidia-smi", "gpu monitoring", "nvml", "cuda", "deep learning", "gpu troubleshooting", "system administration"]

Deep Dive into 'nvidia-smi'

Objective To transform users from basic nvidia-smi observers into power users capable of advanced GPU diagnostics, performance tuning, and automated monitoring.

Core Concept: The GPU's Dashboard

nvidia-smi (NVIDIA System Management Interface) is the primary command-line utility for monitoring and managing NVIDIA GPUs. It acts as a "dashboard," providing real-time data on your GPU's health and performance.

Under the Hood: nvidia-smi is a client for the NVIDIA Management Library (NVML), a C library that provides the interface to the GPU driver. This means you can also use NVML bindings in languages like Python (pynvml) to build custom monitoring tools.
Primary Use Cases:
- Monitoring GPU utilization and temperature during model training.
- Diagnosing "CUDA out of memory" (OOM) errors by identifying processes consuming VRAM.
- Troubleshooting performance bottlenecks like thermal throttling.
- Managing multi-GPU environments.

The Output: A Field Guide

Running nvidia-smi provides a table of information. Here's how to interpret the key fields:

GPU: The ID of the GPU (e.g., 0, 1) in a multi-GPU system.
Fan, Temp, Perf: Fan speed (%), GPU temperature (°C), and performance state (P0 is max performance, P12 is idle).
Pwr:Usage/Cap: Current power draw versus the board's power limit (TDP).
Memory-Usage: The most critical field for many. It shows Used / Total VRAM.
GPU-Util: The percentage of time the GPU's compute cores were busy. Note the distinction between high VRAM usage and high GPU utilization.
Processes: A list of PIDs and process names currently using the GPU, along with their individual memory consumption. This is invaluable for finding "zombie" processes.

Essential Commands and Workflows

Basic Monitoring

The Standard View: nvidia-smi
Continuous Monitoring: Instead of a one-time snapshot, use watch or the built-in loop option for real-time updates.
- watch -n 1 nvidia-smi (Updates every 1 second)
- nvidia-smi -l 1 (Native loop, updates every 1 second)

Targeted Queries

List GPUs: nvidia-smi -L provides a clean list of all detected GPUs and their UUIDs.
Focus on a Single GPU: nvidia-smi -i 0 restricts the view to GPU 0.
Get Detailed Info: nvidia-smi -i 0 -q gives a verbose, detailed report for GPU 0, including clock speeds, temperature limits, and ECC status.
Filter Information: Use the -d flag to display only specific categories.
- nvidia-smi -q -d MEMORY,POWER shows only memory and power data.

Process Management

Find the Culprit: When you get an OOM error, the main table's "Processes" section is your first stop.
Clean Output for Scripting: nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv provides a clean CSV output of all compute processes, which is perfect for parsing in scripts.

Advanced Diagnostics and Troubleshooting

Scenario 1: Diagnosing "CUDA out of memory"

Check Baseline: Run nvidia-smi before starting your task to see how much VRAM is already occupied by other processes (e.g., a display manager).
Identify Process: If an OOM occurs, use the process list to find the PID of the job that consumed all the memory.
Kill the Process: Use kill <PID> to terminate the process and free up the VRAM.

Scenario 2: Troubleshooting Performance Drops If your training loop suddenly slows down, use continuous monitoring (watch -n 0.5 nvidia-smi) and look for:

Thermal Throttling: A sudden spike in Temp followed by a change in Perf state (e.g., from P2 to P8) indicates the GPU is reducing its clock speed to cool down.
Power Capping: If Pwr:Usage is constantly hitting the Cap, the GPU might be limited by its power budget.

Daemon Mode (dmon) For a more compact, columnar output ideal for long-term monitoring, use the daemon mode.

nvidia-smi dmon -s pumt -d 1
-s pumt: Specifies the metrics to show: power, utilization, memory, temperature.
-d 1: Sets the update interval to 1 second.

Automation and Scripting

You can use nvidia-smi in shell scripts or Python to automate monitoring and logging.

**Python Scripting with **pynvml Since nvidia-smi is a wrapper for NVML, you can use the pynvml library for more direct and faster access in Python.

import pynvml
import time

pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()

print(f"Found {device_count} GPU(s)")

try:
    while True:
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0  # Convert mW to W
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)

            print(f"[{i}] {name}")
            print(f"  Temp: {temp}°C")
            print(f"  Power: {power:.2f}W")
            print(f"  Memory: {mem_info.used // 1024**2} / {mem_info.total // 1024**2} MB")
            print(f"  GPU Util: {util.gpu}%")
            print("-" * 20)
        time.sleep(2)
except KeyboardInterrupt:
    print("\nMonitoring stopped.")
    pynvml.nvmlShutdown()