Alibabacloud Compute Provision

Alibaba Cloud Compute Provision - Automatically selects an Alibaba Cloud compute resource (ECS, FC, ACK, PAI) based on user intent, then creates instances and executes scripts. Use this skill when the user needs to run compute jobs, execute scripts, train models, or deploy containerized applications on Alibaba Cloud, or mentions keywords such as cpu_bound, gpu, vCPU, budget, training, A100, or qwen. Provides a full loop of resource selection, pricing, budget control, instance creation, and script execution.

Audits

Error

Install

openclaw skills install alibabacloud-compute-provision

Alibaba Cloud Compute Provision

Automatically selects an Alibaba Cloud compute resource based on user intent, then creates instances and executes scripts.

How this skill works

This skill operates by writing and executing Python code that calls Alibaba Cloud APIs. The scripts/ directory contains ready-made Python modules (ECS, FC, ACK, PAI, VPC, etc.) that wrap the Alibaba Cloud OpenAPI. To accomplish any task in this skill, you write Python code snippets that import and call functions from these modules — you do NOT use CLI tools, Terraform, or the web console.

Typical workflow:

  1. Read the reference doc for the product you're about to use (see Reference Index below).
  2. Write a Python code block that imports from the skill's scripts/ modules.
  3. Execute the code to call Alibaba Cloud APIs (query instance types, check pricing, create resources, run scripts, etc.).
  4. Read the output and decide the next step.

⛔ MUST-READ RULE: Before calling ANY function from scripts/, you MUST first read its reference doc (e.g. references/ecs.md for ECS functions, references/fc.md for FC functions). The reference docs contain exact function signatures, parameter names, constraints, and usage examples. Do NOT guess parameter names — incorrect parameters waste tool calls and may create/leak cloud resources. Use the defaults when in doubt.

Prerequisites

Step 0: Environment bootstrap (MUST run first)

Before doing anything else, execute the following code block to set up the Python path and ensure all dependencies are installed. This MUST be the very first code you run in every session — do NOT skip it or defer it.

import sys
sys.path.insert(0, "${SKILL_DIR}/scripts")

from bootstrap import ensure_dependencies
ensure_dependencies()

bootstrap.py is a standalone module with zero third-party dependencies (stdlib only), so it can always be imported even before any pip packages are installed. ensure_dependencies() automatically:

  • Checks that the Python version is >= 3.8 (exits with a clear error if not).
  • Detects missing pip packages (alibabacloud_credentials, alibabacloud_tea_openapi, darabonba-core) and installs them.

If this step fails, fix the reported issue (e.g. install a newer Python) before proceeding — all subsequent steps depend on it.

Credentials

Credentials are resolved via the Alibaba Cloud default credential provider chain (environment variables, ~/.alibabacloud/credentials, ~/.aliyun/config.json, ECS RAM role, etc.). Do NOT hardcode AK/SK or read them explicitly.

ALIBABA_CLOUD_REGION   # optional, defaults to cn-hangzhou

Step 1: Intent Parsing and Resource Selection

1.1 Parse user intent

Extract the following elements from the user's input:

ElementDescriptionExample
Task typeOne-shot script / Long-running service / AI training"deploy nginx" → long-running service
Compute requirementCPU / GPU / memory"8 vCPU, 16 GB"
BudgetCost cap"$50"
Script / intentExplicit script or task description"a.sh" or "deploy an nginx site"

1.2 Script generation (when no explicit script is provided)

When the user provides intent rather than a script (e.g. "deploy an nginx site"), generate the script automatically. Key rules:

  • Script-image coupling: package managers depend on the OS — Ubuntu uses apt-get, CentOS/Alinux uses yum. Finalize the script only after the image is decided; if the image changes later, re-check script compatibility.
  • Long-running service scripts must use background/systemd commands (e.g. systemctl start nginx), not foreground-blocking ones.
  • One-shot task scripts simply exit when finished.

1.3 Resource selection

If the user explicitly specifies a product, use that product directly and skip selection comparison.

⛔ PRODUCT-LOCK RULE: When the user explicitly specifies a product (e.g. "用 ECS", "use FC"), you are locked to that product for the entire task. If you encounter errors (out of stock, quota limits, etc.), you MUST retry within the same product — try different availability zones, regions, or instance types. NEVER silently switch to a different product. If all retries within the specified product are exhausted, report the failure to the user and ask for guidance — do NOT auto-switch.

For ECS, use ecs.find_available_instance_type() to search across zones/regions for available stock and pricing, then after cost confirmation use ecs.create_instance_with_infra() to create the instance.

When unspecified, follow the decision tree in references/select-resource.md:

User specified a product?  → use it directly
Long-running service?      → ECS or ACK (FC / PAI-DLC are not suitable for long-running)
AI / ML training?          → PAI or FC (GPU) → if both viable, MUST compare in Step 1.5
K8s / containers?          → ACK
Multiple products viable?  → MUST compare in Step 1.5
Default (single match)     → ECS

⛔ ANTI-BIAS: The decision tree only narrows candidates. When 2+ products remain, you MUST proceed to Step 1.5 for real API-based comparison — never assume one is "obviously cheaper" from general knowledge.

1.4 Region selection — MANDATORY BEFORE resource creation

⛔ HARD RULE: Region selection MUST be performed explicitly as a documented step — not deferred to or assumed during resource creation. The chosen region directly affects network connectivity, package installation success, and end-to-end reliability.

Decision flow (execute in order):

  1. Detect external dependency requirements — scan the script (user-provided or agent-generated) and the task intent for signals that the workload will access overseas sources at runtime:

    • Package managers pulling from default mirrors: pip install, npm install, apt-get install, yum install, go get, cargo build, gem install, composer install
    • Downloads from GitHub, Docker Hub, PyPI, npmjs.com, Hugging Face, or other overseas hosts
    • curl / wget to non-Chinese URLs
    • Git clone from github.com / gitlab.com
  2. Apply region rule:

    ConditionRegionRationale
    Script installs external dependencies from overseas sources (pip, npm, apt, GitHub, etc.)Overseas region (prefer ap-southeast-1 Singapore)Domestic regions have poor/unstable connectivity to overseas package registries, causing timeouts and failures
    Task deploys a website/service with no overseas dependenciesDomestic region (e.g. cn-hangzhou, cn-shanghai)Lower latency for end users
    AI training downloading models/datasets from Hugging Face, GitHub, etc.Overseas regionModel downloads from China often timeout
    No external network access needed (pure compute, local data)Domestic region (e.g. cn-hangzhou)Default, lowest latency
    User explicitly specified a regionUser's specified regionRespect user choice

    Pitfall: deploying a website seems "domestic", but if the setup script runs npm install / pip install, the packages come from overseas — choose an overseas region. Always check the script's dependency commands, not just the service purpose.

  3. Output the chosen region and reason to the user before proceeding:

    Region: ap-southeast-1 (Singapore)
    Reason: The task requires installing packages via pip/npm from overseas sources.
            Domestic regions may cause installation timeouts.
    

1.5 Multi-option parallel comparison — MANDATORY SUB-AGENT DISPATCH

⛔ HARD RULE: When the user has NOT explicitly specified a product AND the decision tree yields more than one candidate, you MUST launch parallel sub-agents — one per candidate product. It is strictly forbidden to compare in the main thread using documentation knowledge or heuristics alone.

Dispatch rules:

  1. One sub-agent per candidate product — launch them in parallel (e.g. one for ECS, one for FC).
  2. Each sub-agent MUST call real APIs — instance-type queries (DescribeInstanceTypes), inventory checks (DescribeAvailableResource), and pricing queries (DescribePrice or product-specific formulas). Memorized prices are NOT acceptable.
  3. Return structured results — format defined in the "Sub-agent task template" in references/select-resource.md.
  4. Main agent aggregates and presents — build the comparison table (template in references/select-resource.md), recommend the best option, and wait for user confirmation.

Comparison dimensions (all required): end-to-end time, estimated cost (from API), complexity, resource cleanup.

When uncertain about API usage, search the docs with scripts/doc_search.py:

from doc_search import search_and_format
print(search_and_format("DescribeInstanceTypes", product="ecs"))

Step 2: Create Compute Resources

After selecting a product, read its reference doc (linked below) for full API usage — especially function signatures and parameter constraints — then create resources. Use the region from Step 1.4; if Step 1.4 is not yet done, go back and complete it first.

ProductReferenceWorkflow summary
ECSreferences/ecs.mdfind_available_instance_type()cost confirmationcreate_instance_with_infra() (VPC/SG/image handled internally)
FCreferences/fc.mdchoose spec → cost confirmation → create function → invoke function
ACKreferences/ack.mdchoose node spec → cost confirmation → VPC/SG → create cluster → submit K8s Job
PAIreferences/pai.mdlist_ecs_specs → choose CPU/GPU → cost confirmation → create_training_job

Network preparation (ACK only; ECS is handled by create_instance_with_infra): see references/vpc.md

MANDATORY RULE: Cost confirmation

⛔ HARD BLOCK: Before calling ANY resource-creation API (RunInstances, CreateFunction, CreateCluster, CreateTrainingJob), you MUST estimate cost and get user confirmation. The agent may NOT self-approve — regardless of how low the cost is.

Flow:

  1. Estimate cost — use the product's pricing API or formula (see each product's reference doc).
  2. Output the cost estimate using the template below — do not omit or summarize it.
  3. Wait for user confirmation — stop and do nothing further until the user replies affirmatively (e.g. "yes", "ok", "确认"). Silence or implied consent do NOT count.
  4. Proceed only after receiving confirmation.
  5. If over budget — recommend a cheaper alternative, re-estimate, and repeat from step 2.

Skip-confirmation exception: if the user has explicitly stated in the current conversation that no confirmation is needed (e.g. "直接执行不用确认", "skip confirmation", "just do it, no need to ask"), then still output the cost estimate (step 2) for the record, but proceed immediately without waiting — skip steps 3-4.

Cost display template:

Cost estimate:
  Spec:        ecs.t6-c1m2.large (2 vCPU, 4 GB)
  Unit price:  CNY 0.017 / hour
  Duration:    ~5 minutes
  Total:       CNY 0.002
  Billing:     PostPaid (pay-as-you-go)

Proceed with creation?

Exchange-rate reference: $1 ≈ CNY 7.2

Step 3: Execute the Script

MANDATORY PRE-EXECUTION CHECK: Script & Resource Validation

⛔ HARD BLOCK: Before executing any script, the following validation steps are required and non-skippable. If validation fails, you MUST stop the flow and report the error to the user. It is strictly forbidden to generate a placeholder/stub script, fabricate execution output, or silently proceed when a required file is missing.

Validation flow (apply before every execution):

  1. Determine script source type:

    • (A) User-provided script path — the user referenced a specific file (e.g. /home/user/train.py, ./scripts/run.sh).
    • (B) User-provided script content — the user pasted the script inline or its content is already in the conversation.
    • (C) Agent-generated script — no explicit script was provided; the agent generated one from intent (per Step 1.2). In this case, the agent already holds the full content — skip to step 3.
  2. For type (A) — verify file existence and content:

    • Local path: use Read tool or ls / cat to confirm the file exists at the given path and is non-empty. If the file is on a remote instance (ECS), run the check via Cloud Assistant (test -f <path> && wc -l <path>).
    • If the file does NOT exist or is empty: immediately stop and report to the user:
      ❌ Script not found: <path>
      The specified script file does not exist or is empty. Please verify the path and try again.
      
      Do NOT create a replacement script, guess the content, or continue execution.
    • If the file exists: read its content to confirm it is a valid, complete script (not a stub or template with only comments/placeholders).
  3. Content completeness check (for all source types):

    • The script must contain actual executable logic — not just comments, empty functions, or pass/TODO placeholders.
    • For training scripts (PAI / GPU tasks): verify the script references the expected framework entry points (e.g. model.fit(), trainer.train(), torch.distributed.launch).
    • If the content appears incomplete, ask the user for clarification before proceeding.
  4. Dependency & environment pre-check (best effort):

    • If the script imports packages or references external data paths, note them so the execution environment can be prepared accordingly (e.g. pip install in the startup command, mount data volumes).

Rationale: creating compute resources costs money. Running a missing or placeholder script wastes that cost and misleads the user into thinking the task succeeded.

Execution methods

ProductTask typeCall
ECSOne-shot (run and release)ecs.run_command_and_cleanup(instance_id, script, infra=infra)
ECSLong-running (keep alive)ecs.run_command_and_wait(instance_id, script)
FCOne-shotfc.create_and_invoke(script_path=path) or fc.create_and_invoke(script_content=code, script_type="shell")
ACKK8s Joback.run_script_as_job(cluster_id, script)
PAITraining jobscript is set at create_training_job time

⛔ ECS cleanup rule: For one-shot tasks, you MUST use run_command_and_cleanup() with the infra parameter (from create_instance_with_infra()). This releases the instance + security group, and only deletes VSwitch/VPC if they were freshly created (shared resources are preserved). Forgetting to release ECS instances causes ongoing charges.

Use run_command_and_wait() (without cleanup) only when the user explicitly needs the instance to stay running (e.g. "deploy a website", "keep the service online").

Error Handling

The whole flow uses retry-with-adjustment:

ErrorStrategy
Out of stockTry in order: switch availability zone → switch region → downgrade instance type. For ECS use find_available_instance_type(regions=[...]) which searches across regions automatically. NEVER switch to a different product.
Quota exceededPrompt user to raise quota
Over budgetDowngrade spec or shrink scale
Script execution failedAnalyze the error, adjust environment / dependencies, then retry
Unknown errorSearch docs with doc_search.search(error_message, product)

Keep adjusting and retrying until the instance is created and the script is running.

Reference Index

DocumentContent
references/select-resource.mdComparison of the four products and selection decision tree
references/vpc.mdVPC / VSwitch API quick reference
references/ecs.mdFull ECS API quick reference (specs / inventory / pricing / creation / execution)
references/fc.mdFC API quick reference + script-packaging method
references/ack.mdACK cluster API quick reference + K8s Job execution
references/pai.mdPAI-DLC training-job API quick reference + GPU spec table
references/ram-policies.mdRAM 最小权限清单与 Policy JSON