Install
openclaw skills install alibabacloud-pai-dlc-job-diagnosticsPAI-DLC job diagnostics and health inspection. Queuing-stuck root cause analysis, failed-job localization, cluster health checks. Companion to the `alibabacloud-pai-dlc-job` skill (read-only — no writes). Triggers: "diagnose", "diagnose job", "job stuck", "why queuing", "queue stuck", "stuck in queue", "job failed", "failure reason", "healthcheck", "health check", "inspect job", "inspection".
openclaw skills install alibabacloud-pai-dlc-job-diagnosticsRead-only diagnostic analysis for PAI-DLC distributed training jobs, covering three scenarios:
Architecture: PAI-DLC Job (read-only queries) + PAI Studio Resource Diagnosis API (queuing scenario).
This skill performs read-only diagnostics only. All write operations
(create / update / stop jobs, resource discovery, etc.) live in the companion
skill alibabacloud-pai-dlc-job. The two skills are complementary in
responsibility and share a common field contract.
| Prerequisite Skill | Role | When to switch to it |
|---|---|---|
alibabacloud-pai-dlc-job | Write ops (create/update/stop) + AIWorkSpace resource discovery | Creating / modifying / stopping jobs, or discovering Image / Dataset / CodeSource |
| This skill | Read-only diagnostics (logs / events / sanity-check / queuing root cause) | Job already exists — troubleshooting or health inspection |
Discover and install the prerequisite skill:
# Discover available skills
npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-find-skills
# Install the alibabacloud-pai-dlc-job skill itself
npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-pai-dlc-job
Cross-skill field contract: The --job-id / --pod-id values this skill
consumes are produced verbatim by alibabacloud-pai-dlc-job via
list-jobs / get-job --cli-query "Pods[0].PodId" — no transformation needed.
--region / --workspace-id follow the same resolution rules in both skills.
Pre-check: Aliyun CLI >= 3.3.1 required Run
aliyun versionto verify >= 3.3.1. If not installed or version too low, see references/cli-installation-guide.md. Then [MUST] runaliyun configure set --auto-plugin-install true.
Note on
--user-agent: Every API-invokingaliyuncommand in this skill MUST include--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics. Client-side helpers (aliyun version,aliyun configure ...,aliyun plugin ...,aliyun <product> --help) do not invoke remote APIs and therefore do not require the flag.
aliyun version
aliyun configure set --auto-plugin-install true
aliyun plugin update
aliyun pai-dlc --help
aliyun paistudio --help >/dev/null 2>&1 || aliyun plugin install --names aliyun-cli-paistudio
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics"
# After session: aliyun configure ai-mode disable
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
- NEVER read, echo, or print AK/SK values
- NEVER ask the user to input AK/SK directly
- ONLY use
aliyun configure listto check credential statusaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session
- Return and re-run after
aliyun configure listshows a valid profile
[MUST] Permission Failure Handling: When any command fails due to permission errors:
- Read
references/ram-policies.mdfor the full permission list- Use
ram-permission-diagnoseskill to guide the user- Pause and wait until the user confirms permissions have been granted
| Product | Permissions | Purpose |
|---|---|---|
| pai-dlc | pai:GetJob, pai:GetPodLogs, pai:GetJobEvents, pai:GetPodEvents, pai:ListJobSanityCheckResults | Job information collection |
| paistudio | paistudio:GetQuotaWorkloadDiagnosis | Queuing resource diagnosis |
IMPORTANT: Parameter Confirmation — Before executing any command, ALL user-customizable parameters (RegionId, JobId, etc.) MUST be confirmed with the user.
| Parameter | Required | Description |
|---|---|---|
region | Yes | Region where the job runs |
job_id | Yes | DLC job ID (e.g., dlcXXX) |
When a diagnostic request arrives, first call get-job to fetch job status,
then route by status:
| Job status | Route to scenario |
|---|---|
Queuing / Creating | → Queuing-stuck root cause analysis |
Failed | → Failed-job localization |
Running | → Health inspection |
Stopped | Inform the user "job was actively stopped", no diagnosis |
Succeeded | → Historical review (follow Scenario 3 Execution steps) |
Edge case — job was queuing but is now Stopped/Succeeded: If the user
describes the job as "stuck in queue" but get-job shows Stopped or
Succeeded, still route to Scenario 1 (queuing analysis) but expect the
resource diagnosis API to return HTTP 400. Follow the "Fallback on API Failure"
procedure in Scenario 1.
Users may also directly request a specific scenario (e.g., "run a health inspection" even when status is not Running).
aliyun pai-dlc get-job --region <r> --job-id <id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-job-events --region <r> --job-id <id> --max-events-num 50 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-pod-events --region <r> --job-id <id> --pod-id <pod> --max-events-num 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-pod-logs --region <r> --job-id <id> --pod-id <pod> --max-lines 100 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc list-job-sanity-check-results --region <r> --job-id <id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-job-sanity-check-result --region <r> --job-id <id> --sanity-check-number 1 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun paistudio GET /api/v1/quotas/{quota_id}/workloads/{job_id}/diagnosis \
--region <r> --header "Content-Type=application/json" --force \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
Hard constraint: quota_id MUST come from get-job's ResourceId field.
If ResourceId is empty (public pay-as-you-go), this API is unavailable.
Full API structure: see references/resource-diagnosis-api.md.
Trigger: job status = Queuing / Creating and user reports it cannot be scheduled.
Tools: get-job → paistudio resource diagnosis → (optional) get-job-events.
Hard constraints:
ResourceId empty → resource diagnosis unavailable; mine events for cluesResourceId non-empty → resource diagnosis API is the primary instrumentCRITICAL: pai-dlc vs paistudio — two different products
Product Scope Commands pai-dlc Job/Pod lifecycle (GetJob, GetJobEvents, ListJobs, GetPodLogs) aliyun pai-dlc get-job ...paistudio Platform-level services including resource diagnosis aliyun paistudio GET /api/v1/quotas/...The resource diagnosis API belongs to paistudio, NOT pai-dlc. Do NOT call
pai-dlc GetResourceQuota,pai-dlc ListResourceQuotas, or anypai-dlc GET /api/v1/resourcequotas/...— these are wrong APIs and will fail. The correct command is:aliyun paistudio GET /api/v1/quotas/{quota_id}/workloads/{job_id}/diagnosis ...
Pattern knowledge: resource diagnosis returns 4 checks
(self_quota / ancestor_quota / user_limit / queue_strategy), plus node
scheduling and hyper-node analysis. Common patterns:
references/diagnostic-patterns.md §1.
Agent latitude: decide whether to compute the quota gap, whether to pull events for corroboration, and how verbose the report should be.
The PAI Studio resource diagnosis API may fail with HTTP 400/404 when the
job is no longer in an active queuing state (e.g., already Stopped by user).
In this case:
"Resource diagnosis API unavailable: HTTP {code} — {error message}"get-job → ResourceRequest (GPU/CPU/Memory demand per pod)get-job → PodCount × per-pod resources = total demandget-job → EcsSpec (instance type and its per-node capacity)get-job-events → scheduling event timeline and queuing duration## Resource Diagnosis (API Unavailable — Configuration-Based Analysis)
- Diagnosis API: unavailable (HTTP {code}: {message})
- Resource demand: {PodCount} pods × {GPU/pod} GPU = {total} GPU cards
- Instance type: {EcsSpec}
- Queuing duration: {hours}h {minutes}m
- Quota ID: {ResourceId}
- Job final status: {status} ({ReasonCode})
- Conclusion: The job requested {total} GPU cards which could not be
fulfilled within the queuing window before the job was {stopped/completed}.
Trigger: job status = Failed.
Tools: get-job → (as needed) get-job-events / get-pod-events / get-pod-logs.
Hard constraints:
Stopped is not a failure — do not diagnoseResourceAllocateFailed (insufficient resources)ReasonMessage contains preempted or evicted)Mandatory output template for preemption/eviction/resource-shortage failures:
## Diagnosis Conclusion
- Failure reason: {classification} ({ReasonCode}: {ReasonMessage})
- Affected pods: {pod list with SubStatus}
- Timeline: {key timestamps from events}
- Evidence: {quoted ReasonMessage or event details}
For quota policy or resource allocation adjustments, please contact your
platform administrator.
Pattern knowledge: failure-classification priority (network > image > runtime > resource > config > system), exit-code meanings, keyword-matching rules. See references/diagnostic-patterns.md §2.
Agent latitude: when ReasonCode is clear, logs may be unnecessary; when
logs already explain the issue, events may be unnecessary. Decide investigation
depth based on information sufficiency.
Trigger: job status = Running and user requests inspection / health check.
Tools: get-job + get-job-events + get-pod-logs
list-job-sanity-check-results.Execution steps:
get-job → obtain status, WorkspaceId, Pod list, whether EnableSanityCheck is setget-job-events → event-chain analysis (scheduling, restarts, etc.)get-pod-logs → target the master pod (rank=0) first; extract structured training metrics if presentlist-job-sanity-check-results → execute only if EnableSanityCheck=true in job settingshttps://pai.console.aliyun.com/?regionId={region}&workspaceId={workspace_id}#/dlc/jobs/{job_id}/overviewhttps://pai.console.aliyun.com/?regionId={region}&workspaceId={workspace_id}#/dlc/jobs/{job_id}/monitor> GPU/memory real-time resource utilization metrics require the monitoring
> dashboard above. This skill's CLI commands do not support fetching
> runtime utilization data directly.
Dimension matrix (mandatory vs optional):
| Dimension | Mandatory/Optional | Notes |
|---|---|---|
| Training throughput | Optional | Extract from master pod logs |
| Hang detection | Running only | Skip for Succeeded jobs |
| Hardware health | Optional | Requires EnableSanityCheck=true |
| Restart stability | Mandatory | Read RestartCount directly from get-job |
Note on resource utilization: GPU/memory metrics are NOT available via this skill's CLI commands. The monitoring dashboard link (step 5) and its closing notice are mandatory in every inspection report — do not omit them.
Interpretation rules per dimension: see references/healthcheck-dimensions.md.
Agent latitude: kilo-card jobs focus on Hang + SanityCheck; small jobs look at training throughput from logs. Decide depth and report verbosity from scale and user intent.
get-job first, route second — never assume status; query and then route--max-lines 100 / --max-events-num 50 to avoid context blow-upget-job Pods, focus on Status=Failed/Unknown or those with ReasonMessageexit 137 may be OOM or external kill; combine with contextReasonCode already states the cause, skip the full log dumpstop / update / create| Dimension | Rating | Key Finding |
|---|---|---|
| Training throughput | ✅/⚠️/❌/N/A | ... |
| Hang detection | ✅/⚠️/❌/N/A | ... |
| Hardware health | ✅/⚠️/❌/N/A | ... |
| Restart stability | ✅/⚠️/❌ | ... |
| Document | Contents |
|---|---|
| references/resource-diagnosis-api.md | PAI Studio resource diagnosis API full reference |
| references/diagnostic-patterns.md | Failure pattern knowledge base (queuing / failure / hang / restart) |
| references/healthcheck-dimensions.md | Health inspection dimensions and interpretation rules |
| references/ram-policies.md | RAM permission policies |
| references/related-commands.md | CLI command quick reference |
| references/acceptance-criteria.md | Acceptance criteria |
| references/verification-method.md | Verification method |
| references/cli-installation-guide.md | CLI installation guide |