Gpu Container Setup Flagos

Automation

Automatically detect GPU vendor, find appropriate PyTorch container image, launch with correct mounts, and validate GPU functionality. Supports NVIDIA, Ascend, Metax, Iluvatar, and AMD/ROCm. Use when user says "setup container", "start pytorch container", or invokes /gpu-container-setup.

Install

openclaw skills install gpu-container-setup-flagos

GPU Container Setup Skill

This skill automates multi-vendor GPU container setup for PyTorch workloads.

Supported GPU Vendors

VendorPyTorch BackendDetection
NVIDIACUDAnvidia-smi
AMDROCm (HIP)rocm-smi, /opt/rocm
Ascendtorch_npunpu-smi, /usr/local/Ascend
Metaxtorch_musamx-smi, /opt/metax
Iluvatartorch_corexixsmi, /opt/iluvatar

Execution Flow

When invoked, follow these steps:

Step 1: Parse Arguments

Check if user provided:

  • --vendor <name> - Force specific vendor (skip detection)
  • --image <image> - Force specific container image
  • --data <path> - Force specific data mount path
  • --name <name> - Container name (default: pytorch-gpu)

Step 2: Detect GPU Vendor

Run the detection script:

python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py

Expected output:

{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}

If detection fails and no --vendor flag provided, ask user which vendor to use.

Step 3: Find Data Disk

Run the data disk detection:

python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py

Expected output:

{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}

If no suitable disk found, ask user for data mount path.

Step 4: Find Container Image

Follow strict priority order (only proceed to next if current fails):

1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User

Step 4.1: Primary Vendor Hub (hardcoded URLs)

VendorRegistryAPI/Query
NVIDIAnvcr.iohttps://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags
Ascendascendhub.huawei.comPortal: https://ascendhub.huawei.com
Metaxregistry.metax-tech.comhttps://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list
Iluvatarhub.iluvatar.comhttps://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list
AMDdocker.io (rocm/pytorch)https://hub.docker.com/v2/repositories/rocm/pytorch/tags
# Example: Query NGC for latest NVIDIA PyTorch
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}\.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"

Step 4.2: BAAI Harbor (fallback)

Only if Step 4.1 fails (unreachable, no image, pull fails).

# Query BAAI Harbor
curl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"

Step 4.3: Web Search (fallback)

Only if Steps 4.1 and 4.2 fail. Search for "<vendor> pytorch docker official".

Step 4.4: Local Images (fallback)

Only if Steps 4.1-4.3 fail. Check docker images | grep pytorch.

Test Before Use

docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"

If test fails, try next source. If all fail, ask user for image.

Step 4.5: Update Skill (self-improvement)

IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update references/image-sources.md to add the newly discovered vendor hub as a primary source. This makes future lookups faster.

# After successful web search discovery:
# 1. Verify image works (pull + pytorch test + GPU test)
# 2. Extract registry URL pattern
# 3. Update references/image-sources.md Step 1 section with new vendor hub

Step 5: Build Docker Command

Refer to references/mount-requirements.md for vendor-specific requirements.

NVIDIA:

docker run -d --gpus all \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

AMD/ROCm:

docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Ascend:

docker run -d \
  --device=/dev/davinci0 --device=/dev/davinci1 ... \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend:/usr/local/Ascend:ro \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Metax:

docker run -d \
  --device=/dev/mx0 --device=/dev/mx1 ... \
  -v /opt/metax:/opt/metax:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Iluvatar:

docker run -d \
  --device=/dev/bi0 --device=/dev/bi1 ... \
  -v /opt/iluvatar:/opt/iluvatar:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Step 6: Start Container

Execute the docker run command. If container with same name exists:

  1. Check if it's running - offer to use existing or replace
  2. If stopped - offer to restart or replace

Step 7: Validate PyTorch GPU

Copy and run validation script inside container:

docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py

Expected output:

{
  "status": "PASS",
  "backend": "npu",
  "device_count": 8,
  "device_names": ["Ascend 910B", ...],
  "tests": {
    "device_detection": true,
    "tensor_creation": true,
    "matrix_multiply": true,
    "gpu_to_cpu_transfer": true
  }
}

Step 8: Report Results

Summarize to user:

  • GPU vendor and devices detected
  • Container name and image used
  • Data mount path
  • Validation status
  • How to access: docker exec -it pytorch-gpu bash

Error Handling

ErrorAction
No GPU detectedAsk user for vendor or check drivers
Image pull failsTry alternative registry or web search
Container start failsCheck device permissions, show error
Validation failsShow detailed error, suggest fixes

Reference Files

  • references/gpu-detection.md - Detection methods by vendor
  • references/image-sources.md - Image discovery guide (registry APIs, priority order, selection criteria)
  • references/mount-requirements.md - Vendor mount specifications

Example Usage

User: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601