Alibabacloud Compute Provision

Alibaba Cloud Compute Provision - Automatically selects an Alibaba Cloud compute resource (ECS, FC, ACK, PAI) based on user intent, then creates instances and executes scripts. Use this skill when the user needs to run compute jobs, execute scripts, train models, or deploy containerized applications on Alibaba Cloud, or mentions keywords such as cpu_bound, gpu, vCPU, budget, training, A100, or qwen. Provides a full loop of resource selection, pricing, budget control, instance creation, and script execution.

Audits

Error

ClawScanError

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPending

Multi-engine malware detections and file reputation.

Install

openclaw skills install alibabacloud-compute-provision

Alibaba Cloud Compute Provision

Automatically selects an Alibaba Cloud compute resource based on user intent, then creates instances and executes scripts.

How this skill works

This skill operates by writing and executing Python code that calls Alibaba Cloud APIs. The scripts/ directory contains ready-made Python modules (ECS, FC, ACK, PAI, VPC, etc.) that wrap the Alibaba Cloud OpenAPI. To accomplish any task in this skill, you write Python code snippets that import and call functions from these modules — you do NOT use CLI tools, Terraform, or the web console.

Typical workflow:

Read the reference doc for the product you're about to use (see Reference Index below).
Write a Python code block that imports from the skill's scripts/ modules.
Execute the code to call Alibaba Cloud APIs (query instance types, check pricing, create resources, run scripts, etc.).
Read the output and decide the next step.

⛔ MUST-READ RULE: Before calling ANY function from scripts/, you MUST first read its reference doc (e.g. references/ecs.md for ECS functions, references/fc.md for FC functions). The reference docs contain exact function signatures, parameter names, constraints, and usage examples. Do NOT guess parameter names — incorrect parameters waste tool calls and may create/leak cloud resources. Use the defaults when in doubt.

Prerequisites

Step 0: Environment bootstrap (MUST run first)

Before doing anything else, execute the following code block to set up the Python path and ensure all dependencies are installed. This MUST be the very first code you run in every session — do NOT skip it or defer it.

import sys
sys.path.insert(0, "${SKILL_DIR}/scripts")

from bootstrap import ensure_dependencies
ensure_dependencies()

bootstrap.py is a standalone module with zero third-party dependencies (stdlib only), so it can always be imported even before any pip packages are installed. ensure_dependencies() automatically:

Checks that the Python version is >= 3.8 (exits with a clear error if not).
Detects missing pip packages (alibabacloud_credentials, alibabacloud_tea_openapi, darabonba-core) and installs them.

If this step fails, fix the reported issue (e.g. install a newer Python) before proceeding — all subsequent steps depend on it.

Credentials

Credentials are resolved via the Alibaba Cloud default credential provider chain (environment variables, ~/.alibabacloud/credentials, ~/.aliyun/config.json, ECS RAM role, etc.). Do NOT hardcode AK/SK or read them explicitly.

ALIBABA_CLOUD_REGION   # optional, defaults to cn-hangzhou

Step 1: Intent Parsing and Resource Selection

1.1 Parse user intent

Extract the following elements from the user's input:

Element	Description	Example
Task type	One-shot script / Long-running service / AI training	"deploy nginx" → long-running service
Compute requirement	CPU / GPU / memory	"8 vCPU, 16 GB"
Budget	Cost cap	"$50"
Script / intent	Explicit script or task description	"a.sh" or "deploy an nginx site"

1.2 Script generation (when no explicit script is provided)

When the user provides intent rather than a script (e.g. "deploy an nginx site"), generate the script automatically. Key rules:

Script-image coupling: package managers depend on the OS — Ubuntu uses apt-get, CentOS/Alinux uses yum. Finalize the script only after the image is decided; if the image changes later, re-check script compatibility.
Long-running service scripts must use background/systemd commands (e.g. systemctl start nginx), not foreground-blocking ones.
One-shot task scripts simply exit when finished.

1.3 Resource selection

If the user explicitly specifies a product, use that product directly and skip selection comparison.

⛔ PRODUCT-LOCK RULE: When the user explicitly specifies a product (e.g. "用 ECS", "use FC"), you are locked to that product for the entire task. If you encounter errors (out of stock, quota limits, etc.), you MUST retry within the same product — try different availability zones, regions, or instance types. NEVER silently switch to a different product. If all retries within the specified product are exhausted, report the failure to the user and ask for guidance — do NOT auto-switch.

For ECS, use ecs.find_available_instance_type() to search across zones/regions for available stock and pricing, then after cost confirmation use ecs.create_instance_with_infra() to create the instance.

When unspecified, follow the decision tree in references/select-resource.md:

User specified a product?  → use it directly
Long-running service?      → ECS or ACK (FC / PAI-DLC are not suitable for long-running)
AI / ML training?          → PAI or FC (GPU) → if both viable, MUST compare in Step 1.5
K8s / containers?          → ACK
Multiple products viable?  → MUST compare in Step 1.5
Default (single match)     → ECS

⛔ ANTI-BIAS: The decision tree only narrows candidates. When 2+ products remain, you MUST proceed to Step 1.5 for real API-based comparison — never assume one is "obviously cheaper" from general knowledge.

1.4 Region selection — MANDATORY BEFORE resource creation

⛔ HARD RULE: Region selection MUST be performed explicitly as a documented step — not deferred to or assumed during resource creation. The chosen region directly affects network connectivity, package installation success, and end-to-end reliability.

Decision flow (execute in order):

Detect external dependency requirements — scan the script (user-provided or agent-generated) and the task intent for signals that the workload will access overseas sources at runtime:
- Package managers pulling from default mirrors: pip install, npm install, apt-get install, yum install, go get, cargo build, gem install, composer install
- Downloads from GitHub, Docker Hub, PyPI, npmjs.com, Hugging Face, or other overseas hosts
- curl / wget to non-Chinese URLs
- Git clone from github.com / gitlab.com

Apply region rule:

Condition	Region	Rationale
Script installs external dependencies from overseas sources (pip, npm, apt, GitHub, etc.)	Overseas region (prefer `ap-southeast-1` Singapore)	Domestic regions have poor/unstable connectivity to overseas package registries, causing timeouts and failures
Task deploys a website/service with no overseas dependencies	Domestic region (e.g. `cn-hangzhou`, `cn-shanghai`)	Lower latency for end users
AI training downloading models/datasets from Hugging Face, GitHub, etc.	Overseas region	Model downloads from China often timeout
No external network access needed (pure compute, local data)	Domestic region (e.g. `cn-hangzhou`)	Default, lowest latency
User explicitly specified a region	User's specified region	Respect user choice

Pitfall: deploying a website seems "domestic", but if the setup script runs npm install / pip install, the packages come from overseas — choose an overseas region. Always check the script's dependency commands, not just the service purpose.

Output the chosen region and reason to the user before proceeding:

Region: ap-southeast-1 (Singapore)
Reason: The task requires installing packages via pip/npm from overseas sources.
        Domestic regions may cause installation timeouts.

1.5 Multi-option parallel comparison — MANDATORY SUB-AGENT DISPATCH

⛔ HARD RULE: When the user has NOT explicitly specified a product AND the decision tree yields more than one candidate, you MUST launch parallel sub-agents — one per candidate product. It is strictly forbidden to compare in the main thread using documentation knowledge or heuristics alone.

Dispatch rules:

One sub-agent per candidate product — launch them in parallel (e.g. one for ECS, one for FC).
Each sub-agent MUST call real APIs — instance-type queries (DescribeInstanceTypes), inventory checks (DescribeAvailableResource), and pricing queries (DescribePrice or product-specific formulas). Memorized prices are NOT acceptable.
Return structured results — format defined in the "Sub-agent task template" in references/select-resource.md.
Main agent aggregates and presents — build the comparison table (template in references/select-resource.md), recommend the best option, and wait for user confirmation.

Comparison dimensions (all required): end-to-end time, estimated cost (from API), complexity, resource cleanup.

When uncertain about API usage, search the docs with scripts/doc_search.py:

from doc_search import search_and_format
print(search_and_format("DescribeInstanceTypes", product="ecs"))

Step 2: Create Compute Resources

After selecting a product, read its reference doc (linked below) for full API usage — especially function signatures and parameter constraints — then create resources. Use the region from Step 1.4; if Step 1.4 is not yet done, go back and complete it first.

Product	Reference	Workflow summary
ECS	references/ecs.md	`find_available_instance_type()` → cost confirmation → `create_instance_with_infra()` (VPC/SG/image handled internally)
FC	references/fc.md	choose spec → cost confirmation → create function → invoke function
ACK	references/ack.md	choose node spec → cost confirmation → VPC/SG → create cluster → submit K8s Job
PAI	references/pai.md	list_ecs_specs → choose CPU/GPU → cost confirmation → create_training_job

Network preparation (ACK only; ECS is handled by create_instance_with_infra): see references/vpc.md

MANDATORY RULE: Cost confirmation

⛔ HARD BLOCK: Before calling ANY resource-creation API (RunInstances, CreateFunction, CreateCluster, CreateTrainingJob), you MUST estimate cost and get user confirmation. The agent may NOT self-approve — regardless of how low the cost is.

Flow:

Estimate cost — use the product's pricing API or formula (see each product's reference doc).
Output the cost estimate using the template below — do not omit or summarize it.
Wait for user confirmation — stop and do nothing further until the user replies affirmatively (e.g. "yes", "ok", "确认"). Silence or implied consent do NOT count.
Proceed only after receiving confirmation.
If over budget — recommend a cheaper alternative, re-estimate, and repeat from step 2.

Skip-confirmation exception: if the user has explicitly stated in the current conversation that no confirmation is needed (e.g. "直接执行不用确认", "skip confirmation", "just do it, no need to ask"), then still output the cost estimate (step 2) for the record, but proceed immediately without waiting — skip steps 3-4.

Cost display template:

Cost estimate:
  Spec:        ecs.t6-c1m2.large (2 vCPU, 4 GB)
  Unit price:  CNY 0.017 / hour
  Duration:    ~5 minutes
  Total:       CNY 0.002
  Billing:     PostPaid (pay-as-you-go)

Proceed with creation?

Exchange-rate reference: $1 ≈ CNY 7.2

Step 3: Execute the Script

MANDATORY PRE-EXECUTION CHECK: Script & Resource Validation

⛔ HARD BLOCK: Before executing any script, the following validation steps are required and non-skippable. If validation fails, you MUST stop the flow and report the error to the user. It is strictly forbidden to generate a placeholder/stub script, fabricate execution output, or silently proceed when a required file is missing.

Validation flow (apply before every execution):

Determine script source type:
- (A) User-provided script path — the user referenced a specific file (e.g. /home/user/train.py, ./scripts/run.sh).
- (B) User-provided script content — the user pasted the script inline or its content is already in the conversation.
- (C) Agent-generated script — no explicit script was provided; the agent generated one from intent (per Step 1.2). In this case, the agent already holds the full content — skip to step 3.
For type (A) — verify file existence and content:
- Local path: use Read tool or ls / cat to confirm the file exists at the given path and is non-empty. If the file is on a remote instance (ECS), run the check via Cloud Assistant (test -f <path> && wc -l <path>).
- If the file does NOT exist or is empty: immediately stop and report to the user:
```
❌ Script not found: <path>
The specified script file does not exist or is empty. Please verify the path and try again.
```
  Do NOT create a replacement script, guess the content, or continue execution.
- If the file exists: read its content to confirm it is a valid, complete script (not a stub or template with only comments/placeholders).
Content completeness check (for all source types):
- The script must contain actual executable logic — not just comments, empty functions, or pass/TODO placeholders.
- For training scripts (PAI / GPU tasks): verify the script references the expected framework entry points (e.g. model.fit(), trainer.train(), torch.distributed.launch).
- If the content appears incomplete, ask the user for clarification before proceeding.
Dependency & environment pre-check (best effort):
- If the script imports packages or references external data paths, note them so the execution environment can be prepared accordingly (e.g. pip install in the startup command, mount data volumes).

Rationale: creating compute resources costs money. Running a missing or placeholder script wastes that cost and misleads the user into thinking the task succeeded.

Execution methods

Product	Task type	Call
ECS	One-shot (run and release)	`ecs.run_command_and_cleanup(instance_id, script, infra=infra)`
ECS	Long-running (keep alive)	`ecs.run_command_and_wait(instance_id, script)`
FC	One-shot	`fc.create_and_invoke(script_path=path)` or `fc.create_and_invoke(script_content=code, script_type="shell")`
ACK	K8s Job	`ack.run_script_as_job(cluster_id, script)`
PAI	Training job	script is set at `create_training_job` time

⛔ ECS cleanup rule: For one-shot tasks, you MUST use run_command_and_cleanup() with the infra parameter (from create_instance_with_infra()). This releases the instance + security group, and only deletes VSwitch/VPC if they were freshly created (shared resources are preserved). Forgetting to release ECS instances causes ongoing charges.

Use run_command_and_wait() (without cleanup) only when the user explicitly needs the instance to stay running (e.g. "deploy a website", "keep the service online").

Error Handling

The whole flow uses retry-with-adjustment:

Error	Strategy
Out of stock	Try in order: switch availability zone → switch region → downgrade instance type. For ECS use `find_available_instance_type(regions=[...])` which searches across regions automatically. NEVER switch to a different product.
Quota exceeded	Prompt user to raise quota
Over budget	Downgrade spec or shrink scale
Script execution failed	Analyze the error, adjust environment / dependencies, then retry
Unknown error	Search docs with `doc_search.search(error_message, product)`

Keep adjusting and retrying until the instance is created and the script is running.

Reference Index

Document	Content
references/select-resource.md	Comparison of the four products and selection decision tree
references/vpc.md	VPC / VSwitch API quick reference
references/ecs.md	Full ECS API quick reference (specs / inventory / pricing / creation / execution)
references/fc.md	FC API quick reference + script-packaging method
references/ack.md	ACK cluster API quick reference + K8s Job execution
references/pai.md	PAI-DLC training-job API quick reference + GPU spec table
references/ram-policies.md	RAM 最小权限清单与 Policy JSON