Install
openclaw skills install alibabacloud-pai-dlc-jobAlibaba Cloud PAI-DLC (Deep Learning Containers) job management skill. Use for creating, managing, and monitoring DLC training jobs and managing reusable job templates. Triggers: "DLC", "PAI-DLC", "JobTemplate", "create-job-template", "list-job-templates", "set-job-template-default-version", "create-tensorboard", "list-tensorboards", "get-dashboard".
openclaw skills install alibabacloud-pai-dlc-jobManage deep learning training jobs on Alibaba Cloud PAI-DLC (Platform for AI - Deep Learning Containers) service.
PAI-DLC is a distributed training service provided by Alibaba Cloud's AI Platform PAI, supporting:
CreateJob configurations as templates with
multi-version management and field constraintsArchitecture: PAI Workspace + DLC Job + Computing Resources (ECS public pay-as-you-go or Lingjun dedicated quota) + AIWorkSpace catalog (images / datasets / code sources / quotas / workspaces).
Pre-check: Aliyun CLI >= 3.3.1 required Run
aliyun versionto verify version >= 3.3.1. If not installed or version is too low, see references/cli-installation-guide.md for installation instructions. Then [Required] runaliyun configure set --auto-plugin-install trueto enable automatic plugin installation.
Note on
--user-agent: Every API-invokingaliyuncommand in this skill MUST include--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job. Client-side helpers (aliyun version,aliyun configure ...,aliyun plugin ...,aliyun <product> --help) do not invoke remote APIs and therefore do not require the flag.
aliyun version
aliyun configure set --auto-plugin-install true
aliyun pai-dlc --help
# JobTemplate (§7.7) requires aliyun-cli-pai-dlc >= 0.3.1.
# If create-job-template --help fails: aliyun plugin update --name aliyun-cli-pai-dlc
aliyun aiworkspace --help >/dev/null 2>&1 || aliyun plugin install --names aliyun-cli-aiworkspace
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job"
# After session: aliyun configure ai-mode disable
This skill does not require any custom environment variables. Credentials are handled by the Alibaba Cloud CLI configuration (see Authentication below). Optionally:
| Variable | Required | Purpose |
|---|---|---|
ALIBABA_CLOUD_PROFILE | Optional | Selects a non-default aliyun configure profile |
ALIBABA_CLOUD_REGION_ID | Optional | Default region when --region is omitted (still recommended to pass --region explicitly) |
Do NOT export ALIBABA_CLOUD_ACCESS_KEY_ID / ALIBABA_CLOUD_ACCESS_KEY_SECRET from
within this session; configure them outside (aliyun configure or shell profile).
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
- NEVER read, echo, or print AK/SK values (e.g.,
echo $ALIBABA_CLOUD_ACCESS_KEY_IDis FORBIDDEN)- NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
aliyun configure setwith literal credential values- ONLY use
aliyun configure listto check credential statusaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
aliyun configurein terminal or environment variables in shell profile)- Return and re-run after
aliyun configure listshows a valid profile
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- Read
references/ram-policies.mdto get the full list of permissions required by this SKILL- Use
ram-permission-diagnoseskill to guide the user through requesting the necessary permissions- Pause and wait until the user confirms that the required permissions have been granted
For detailed permission list, see references/ram-policies.md.
Required Permissions Overview:
| Operation | Required Permission |
|---|---|
| Create Job | pai:CreateJob |
| List Jobs | pai:ListJobs |
| Get Job Details | pai:GetJob |
| Get Pod Logs | pai:GetPodLogs |
| Get Job Events | pai:GetJobEvents |
| Get Job Metrics | pai:GetJobMetrics |
| Update Job | pai:UpdateJob |
| Stop Job | pai:StopJob |
| Stop Job | pai:StopJob |
| Create / Read / Update Job Template | paidlc:CreateJobTemplate / paidlc:GetJobTemplate / paidlc:ListJobTemplates / paidlc:UpdateJobTemplate / paidlc:SetJobTemplateDefaultVersion |
| AIWorkSpace Resource Discovery | paiworkspace:ListWorkspaces / paiimage:ListImages,GetImage / paidataset:ListDatasets,GetDataset / paicodesource:ListCodeSources,GetCodeSource |
AIWorkSpace authorization note:
Image/DataSourceId/CodeSourceId/WorkspaceIdfield values forcreate-jobcome from the AIWorkSpace resource-discovery APIs.--resource-id(QuotaId) is manually provided by the user. RAM users MUST hold the corresponding AIWorkSpace-namespaced permissions listed above (do not abbreviate asaiworkspace:*).
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
| Parameter | Required | Notes |
|---|---|---|
--region | Yes | e.g., cn-hangzhou |
--workspace-id | Yes | From aliyun aiworkspace list-workspaces |
--job-type | Yes | PyTorchJob, TFJob, RayJob, etc. |
--display-name | Yes | Meaningful name (project + model + date) |
--job-specs[].Image | Yes | Verbatim ImageUri from list-images (see §7.6 red line) |
--user-command | Yes | e.g., python train.py |
--job-specs[].EcsSpec | Conditional | Public pay-as-you-go (mutually exclusive with ResourceConfig) |
--resource-id + ResourceConfig | Conditional | Dedicated quota path (mutually exclusive with EcsSpec). User MUST manually provide the QuotaId. |
--data-sources / --code-source | Optional | From list-datasets / list-code-sources |
--template-id | Conditional | When creating Job from JobTemplate |
For all parameters: aliyun pai-dlc create-job --help.
Mutual exclusion summary:
EcsSpec and ResourceConfig are mutually exclusive within a single TaskSpec.Uri and DataSourceId within --data-sources[] are mutually exclusive.Uri and CodeSourceId within --code-source are mutually exclusive.For full parameter reference: see references/related-apis.md.
Before calling create-job, determine the resource path:
EcsSpec in TaskSpec; do NOT pass --resource-id.
"EcsSpec": "ecs.gn6i-c4g1.xlarge"ResourceConfig in TaskSpec
AND pass --resource-id <QuotaId>.
--resource-id quotaXXX + "ResourceConfig": {"CPU": "4", "Memory": "8Gi", "GPU": "1"}EcsSpec and ResourceConfig MUST NOT both appear in the same TaskSpec.
Also required before
create-job:--job-specs[].ImageMUST come fromaliyun aiworkspace list-images;--data-sources[].DataSourceIdfromlist-datasets;--code-source.CodeSourceIdfromlist-code-sources. Full discovery flow → see §7.6.
Distributed architecture choices:
| Topology | JobSpecs shape |
|---|---|
| Single-node | One Worker only |
| TFJob PS-Worker | Both PS (CPU) and Worker (GPU) roles |
| PyTorch multi-node | One Worker with PodCount > 1 |
Optional flags: --enable-gang-scheduling true (all-or-nothing scheduling),
Settings.EnableRDMA: true (high-performance network for multi-node GPU),
Settings.EnableSanityCheck: true (GPU health verification).
# Minimal single-node PyTorch job (public pay-as-you-go)
aliyun pai-dlc create-job \
--region <region> \
--workspace-id <workspace-id> \
--display-name "my-pytorch-training" \
--job-type PyTorchJob \
--job-specs '[{
"Type": "Worker",
"PodCount": 1,
"Image": "<ImageUri-from-aiworkspace-list-images>",
"EcsSpec": "ecs.gn6i-c4g1.xlarge"
}]' \
--user-command 'python train.py' \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For multi-node topologies, see §7.1. For Spot, RDMA, data mounting parameters, use aliyun pai-dlc create-job --help.
# List running jobs (status filter: Creating/Queuing/Running/Succeeded/Failed/Stopped)
aliyun pai-dlc list-jobs \
--region <region> \
--status Running \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Get job detail
aliyun pai-dlc get-job \
--region <region> \
--job-id <job-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Get a specific PodId (for log/event queries)
aliyun pai-dlc get-job \
--region <region> \
--job-id <job-id> \
--cli-query "Pods[0].PodId" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
IMPORTANT: Always limit return size:
--max-lines 100for logs,--max-events-num 50for events.
# Get PodId first, then query logs/events/metrics
POD_ID=$(aliyun pai-dlc get-job --region <region> --job-id <job-id> \
--cli-query "Pods[0].PodId" --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job)
aliyun pai-dlc get-pod-logs --region <region> --job-id <job-id> --pod-id $POD_ID --max-lines 100 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-pod-events --region <region> --job-id <job-id> --pod-id $POD_ID --max-events-num 20 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-job-events --region <region> --job-id <job-id> --max-events-num 50 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-job-metrics --region <region> --job-id <job-id> --metric-type GpuCoreUsage --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
Metric types: GpuCoreUsage, GpuMemoryUsage, CpuCoreUsage, MemoryUsage, NetworkInputRate, NetworkOutputRate, DiskReadRate, DiskWriteRate.
Diagnosis order: get-job (status) → get-job-events → get-pod-logs → get-pod-events.
# All sanity check results
aliyun pai-dlc list-job-sanity-check-results \
--region <region> \
--job-id <job-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Single sanity check result
aliyun pai-dlc get-job-sanity-check-result \
--region <region> \
--job-id <job-id> \
--sanity-check-number 1 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
Discovery flow: list-workspaces → list-image-labels →
list-images → list-datasets → list-code-sources → pai-dlc create-job.
Quota (--resource-id): User MUST manually provide the QuotaId. No CLI discovery step.
# Step 1: Pick a workspace (yields --workspace-id)
aliyun aiworkspace list-workspaces \
--region <region> --page-number 1 --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Step 2: Discover available image labels (MUST run before list-images)
# list-image-labels returns all label Key-Value pairs available in this region.
# Use this to discover valid --labels filters for list-images.
aliyun aiworkspace list-image-labels \
--region <region> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# How to use list-image-labels results:
# - Extract label Keys (e.g. system.chipType, system.framework, system.cudaVersion)
# and their available Values to construct --labels filters
# - Multiple labels can be combined with comma: --labels "key1=val1,key2=val2"
# - Labels format: --labels "Key=Value" (single key-value pair), NOT JSON or spaces
# Step 3: Pick an image (yields WorkerSpec.Image / --job-specs[].Image)
# Labels MUST come from list-image-labels output — NEVER guess or invent label values
# NOTE: Do NOT pass --workspace-id to list-images; official images are global
aliyun aiworkspace list-images \
--region <region> \
--labels "<Key1=Value1,Key2=Value2>" \
--page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# RED LINE: --job-specs[].Image MUST be a verbatim ImageUri from list-images.
# NEVER invent, rewrite, or copy Name/ImageId instead of ImageUri.
# Step 4: Pick a dataset (yields DataSources[].DataSourceId)
aliyun aiworkspace list-datasets \
--region <region> --workspace-id <workspace-id> --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Step 5: Pick a code source (yields CodeSource.CodeSourceId)
aliyun aiworkspace list-code-sources \
--region <region> --workspace-id <workspace-id> --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
Red line (also applies in Section 7.7): Do NOT fall back to ROA generic invocations (
--pathPattern/--method GET|POST|PUT|DELETE) when a plugin is missing or returns an error. Install/upgrade the plugin instead.
Field-mapping, full parameters, and error codes: see references/related-apis.md and references/verification-method.md.
JobTemplate stores a CreateJob configuration (JobSpecs, UserCommand,
DataSources, etc.) as a versioned, reusable template. Six subcommands are
exposed by aliyun-cli-pai-dlc >= 0.3.1:
create-job-template, get-job-template, list-job-templates,
update-job-template, set-job-template-default-version. A Job can be launched from a template via
aliyun pai-dlc create-job --template-id <id>.
Constraints format: When passing
--constraints, use escaped-quote JSON:--constraints '{\"JobSpecs[0].Image\":\"locked\",\"UserCommand\":\"locked\"}'.
For full CRUD workflow, Constraints semantics, JSONPath rules, and pitfalls, see references/job-template-management.md.
Stop is a high-risk operation. Before proceeding, query status with
get-job, present the result to the user, and require explicit confirmation.
Running or Queuing.For the full pre-check + confirmation + execution templates, plus the
update-job low-risk path and get-web-terminal / get-token sharing
commands, see references/job-management.md.
Discover available instance types before choosing EcsSpec in --job-specs.
Results from list-ecs-specs provide the exact EcsSpec value to use.
# GPU public pay-as-you-go instances
aliyun pai-dlc list-ecs-specs \
--region <region> \
--accelerator-type GPU \
--resource-type ECS \
--sort-by GPU \
--order desc \
--page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Lingjun dedicated instances
# Note: --quota-id is only available for whitelisted users
Copy the returned EcsSpec value verbatim into --job-specs[].EcsSpec.
For full parameters see aliyun pai-dlc list-ecs-specs --help.
TensorBoard visualizes training metrics. Seven subcommands under aliyun pai-dlc:
create-tensorboard, list-tensorboards, get-tensorboard, start-tensorboard,
stop-tensorboard, update-tensorboard, get-tensorboard-shared-url.
--job-idand--data-sourcesare mutually exclusive in create.
# Create from a job (most common)
aliyun pai-dlc create-tensorboard \
--region <region> \
--job-id <job-id> \
--display-name "my-training-tb" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Create from a dataset summary path
aliyun pai-dlc create-tensorboard \
--region <region> \
--data-sources '[{"DataSourceId":"<dataset-id>","MountPath":"/mnt/logs"}]' \
--summary-path /mnt/logs \
--display-name "dataset-tb" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For full parameters and lifecycle, see references/related-apis.md TensorBoard section.
Both get-dashboard and get-ray-dashboard return a URL only for RayJob type
jobs. For non-Ray jobs, the response is empty.
# Generic DLC dashboard (RayJob only)
aliyun pai-dlc get-dashboard \
--region <region> \
--job-id <job-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Ray-specific dashboard with optional sharing
aliyun pai-dlc get-ray-dashboard \
--region <region> \
--job-id <job-id> \
--is-shared true \
--token <sharing-token> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For shared access, first obtain a token via get-token --target-type job,
then pass it to get-ray-dashboard --token <token> --is-shared true.
For step-by-step end-to-end verification scripts (resource discovery → CreateJob → log query → cleanup, plus JobTemplate CRUD verification), see references/verification-method.md.
Quick verification:
get-job → Status should be Creating / Queuing / Running shortly after
create-job returns.list-jobs --status Running → Should return the freshly created Job until it
finishes or is stopped.get-pod-logs → Should return non-empty log content once the Pod is past
EnvPreparing.A flat list of every CLI command used by this skill (Product / Command / Description) is in references/related-commands.md.
resnet50-imagenet-20260320.--job-max-running-time-minutes as an auto-stop guard for any long-running
experiment.Settings.EnableSanityCheck: true to verify GPU
devices before training starts.Image / DataSources as locked and UserCommand / Envs as
overridable so consumers focus on business parameters via
create-job --template-id.list-ecs-specs --accelerator-type GPU before
choosing EcsSpec to confirm which instance types are available in the region.| Reference Document | Description |
|---|---|
| references/related-apis.md | Complete API and CLI command reference |
| references/related-commands.md | Flat list of all CLI commands |
| references/ram-policies.md | RAM permission policy details |
| references/verification-method.md | End-to-end verification scripts |
| references/job-management.md | High-risk Stop/Delete/Update flow + Web Terminal |
| references/job-template-management.md | JobTemplate CRUD + Constraints + version management |
| references/acceptance-criteria.md | Skill testing acceptance criteria |
| references/cli-installation-guide.md | CLI installation guide |