Install
openclaw skills install @sdk-team/alibabacloud-ecs-gpu-diagnosisDiagnose Alibaba Cloud ECS GPU instances to detect GPU device status, driver issues, and hardware failures. Use this Skill when users report GPU instance anomalies, deep learning task failures, GPU not visible, or when troubleshooting GPU hardware issues. Supports automatic Alibaba Cloud CLI installation, diagnosis report creation, and polling for diagnosis results.
openclaw skills install @sdk-team/alibabacloud-ecs-gpu-diagnosisInitiate diagnosis on a specified ECS GPU instance to detect GPU device status and output diagnosis results.
Check Alibaba Cloud CLI Environment
which aliyun or aliyun --version to check if CLI is installedreferences/cli-installation.md:
aliyun version to confirm version >= 3.0.299aliyun plugin update to ensure local plugins are up-to-date.aliyun configurealiyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-gpu-diagnosis"
aliyun configure ai-mode disable
references/ram-policies.md :Obtain Required Parameters
INSTANCE_ID is provided (ECS instance ID, format MUST match this regular expression ^i-[a-z0-9]{20}$ )REGION_ID is provided (region ID, like cn-shanghai)Validate Parameters
INSTANCE_ID matches the regex pattern ^i-[a-z0-9]{20}$
aliyun ecs describe-regions --user-agent AlibabaCloud-Agent-Skills/alibabacloud-ecs-gpu-diagnosis
Regions.Region[].RegionId list from the responseREGION_ID exists in the listCheck Instance Operating System Type
aliyun ecs describe-instances --user-agent AlibabaCloud-Agent-Skills/alibabacloud-ecs-gpu-diagnosis --RegionId ${REGION_ID} --InstanceIds '["${INSTANCE_ID}"]'
Instances.Instance[0].OSType field from the responseOSType is "linux": Continue with the subsequent diagnosis processOSType is not "linux": Notify the user and terminate the process:
The current instance ${INSTANCE_ID} has operating system ${OSType}.
This Skill currently only supports Linux operating system instances, other operating systems are not supported.
No further diagnosis process is needed.
Create Diagnostic Report MUST Use the following command to initiate GPU diagnosis. Using alternatives such as RunCommand, nvidia-smi, DescribeInvocationResults, etc., for GPU diagnostics is prohibited.
aliyun ecs create-diagnostic-report \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-ecs-gpu-diagnosis \
--RegionId '${REGION_ID}' \
--ResourceId '${INSTANCE_ID}' \
--MetricSetId 'dms-instanceGPUdevice' \
--output cols=ReportId
Extract ReportId from the output and save it for subsequent queries.
Poll Diagnostic Results
MUST Use the following command to query the diagnosis report status:
aliyun ecs describe-diagnostic-reports \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-ecs-gpu-diagnosis \
--RegionId '${REGION_ID}' \
--ReportIds.1 '${REPORT_ID}'
Handle based on the returned Status field:
Issues field content
Issues is empty or does not exist, report "GPU diagnosis normal, no anomalies detected"Issues contains content, extract each Issue's IssueId, MetricId, Severity, and MetricCategory, and output diagnosis results and recommended actions according to the IssueId mapping table belowSet timeout mechanism: poll up to 60 times (approximately 5 minutes), if still not complete, prompt the user to query manually later.
After diagnosis is complete, the output should include:
The Issues returned in the diagnosis report is an array, where each Issue contains IssueId, MetricId, Severity, and MetricCategory fields. Output diagnosis description and handling measures according to the IssueId mapping table below:
| IssueId | Diagnostic Description | Exception Handling Measures |
|---|---|---|
| GuestOS.GPU.MemoryEccCheckError | Detect GPU Double Bit Error conditions | Prompt user to restart instance based on error count |
| GuestOS.GPU.InfoRomCorrupted | Detect GPU infoROM firmware information | O&M notification will be sent to user |
| GuestOS.GPU.DriverVersionMismatch | Detect driver anomalies caused by Kernel upgrades | User needs to uninstall and reinstall driver |
| GuestOS.GPU.FabricmanagerCheck | Detect Fabricmanager component running status | User needs to install or start Fabricmanager component service |
| GuestOS.GPU.PowerCableError | Detect GPU power cable and power supply status | O&M notification will be sent to user |
| GuestOS.GPU.DeviceLost | Detect GPU card loss conditions | O&M notification will be sent to user |
| GuestOS.GPU.DriverNotInstalled | Detect GPU driver installation status | User needs to install driver |
| GuestOS.GPU.NVXidError | Detect GPU Xid error anomalies | Prompt user to restart instance based on different XID errors |
| GuestOS.GPU.RmInitAdapterError | Detect GPU card initialization anomalies, manifested as driver card loss | O&M notification will be sent to user |
| GuestOS.GPU.NVLinkError | Check GPU NVlink status | O&M notification will be sent to user |
Output Format Example:
Diagnosis Complete! Instance: i-bp1xxxxxxxxx (cn-shanghai)
Report ID: dr-xxxxxxxx
1 anomaly found:
[1] GuestOS.GPU.DriverNotInstalled
Severity: Warn
Diagnostic Description: Detect GPU driver installation status
Handling Measures: User needs to install driver
Diagnostic Recommendations:
- Please install the corresponding version of NVIDIA GPU driver
- Installation Guide: https://help.aliyun.com/document_detail/108460.html
Special Reminder: When the exception handling measure is "O&M notification will be sent to user", append the following reminder to the output:
⚠️ Important Reminder:
- Alibaba Cloud will send you O&M event notifications
- Please go to the ECS console to view event details
- Pay attention to whether you receive O&M events and handle them as required
If Issues is an empty array or does not exist, output:
Diagnosis Complete! Instance: i-bp1xxxxxxxxx (cn-shanghai)
Report ID: dr-xxxxxxxx
GPU diagnosis normal, no anomalies detected.
User: Help me diagnose this GPU server i-bp1xxxxxxxxx
Agent:
1. Check CLI is installed
2. Ask for region (user did not provide)
3. User replies: cn-shanghai
4. Check instance OS type is Linux
5. Execute CreateDiagnosticReport, get ReportId: dr-xxxxxxxx
6. Poll DescribeDiagnosticReports
7. Status=InProgress, wait 5 seconds...
8. Query again, Status=Finished
9. Output Issues content to user