Install
openclaw skills install alibabacloud-ecs-health-inspectionECS Health Inspection - Perform comprehensive health inspection on Alibaba Cloud ECS instances. Checks CPU usage, system load, memory usage, disk IO (BPS and IOPS), network traffic, disk capacity, and GPU metrics (when applicable). Generates a structured HTML inspection report with findings and recommendations. Use when users request ECS health inspection, instance health check, performance inspection, resource usage report, or system health status of an ECS instance.
openclaw skills install alibabacloud-ecs-health-inspectionPerforms a full-dimension read-only inspection on a single ECS instance, automatically selecting the optimal data source (CloudMonitor preferred, ECS Monitor API as fallback) and producing a structured HTML report.
ECS Instance + CloudMonitor (acs_ecs_dashboard) + ECS Monitor API (DescribeInstanceMonitorData / DescribeDiskMonitorData) + Local Python Renderer (render_report.py)
Read-only path. Zero resource creation or mutation.
Run
aliyun versionto verify >= 3.3.3. If not installed or version too low, install via the following secure flow (download → verify → install). Do NOT usecurl ... | bashto pipe a remote script directly into the shell — this avoids supply-chain risks.Step 1: Download the installer and the setup script:
# Choose the tarball that matches your architecture (amd64 / arm64) and OS curl -fsSL -o aliyun-cli.tgz https://aliyuncli.alicdn.com/aliyun-cli-macosx-latest-arm64.tgz curl -fsSL -o setup.sh https://aliyuncli.alicdn.com/setup.shStep 2: Inspect the script manually before executing:
less setup.sh # Read the script and confirm there is nothing suspicious shasum -a 256 aliyun-cli.tgz setup.sh # Record hashes; cross-check with the official channel bash ./setup.sh # Execute only after the review passes aliyun version # Verify >= 3.3.3Additional installation methods: https://help.aliyun.com/zh/cli/install-cli-on-macos-or-linux
[MUST] Enable plugin auto-install and refresh existing plugins (CMS uses plugin-mode kebab-case via
aliyun-cli-cms0.3.0+):aliyun configure set --auto-plugin-install true aliyun plugin updateVerify the cms plugin is active (optional):
aliyun cms --help | head -3should printNote: The help information for product 'cms' is provided by the installed plugin 'aliyun-cli-cms'.
This skill ships one script — scripts/render_report.py — and explicitly declares its dependencies in two synchronized locations:
Dependencies block).| Layer | Requirement | Notes |
|---|---|---|
| Runtime | Python >= 3.8 | Required |
| Third-party packages | (none) | The script is intentionally stdlib-only; no pip install needed |
| Standard library | argparse, html, json, sys, typing, __future__ | All shipped with CPython |
Verify the runtime: python3 --version.
Install (a no-op today, kept for future-proofing CI/containers):
python3 -m pip install -r scripts/requirements.txt
Upgrade policy: introducing any third-party dependency (e.g. Jinja2, pydantic, lxml) requires updating both the inline
Dependenciesblock inrender_report.pyandscripts/requirements.txt, plus a one-line note in this section.
This skill does not require any extra environment variables. Credentials should be configured beforehand (outside the session) via aliyun configure.
| Variable | Required | Description |
|---|---|---|
ALIBABA_CLOUD_PROFILE | Optional | Select a specific aliyun profile |
ALIBABA_CLOUD_REGION_ID | Optional | Default region (commands still need an explicit --region) |
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
- NEVER read, echo, or print AK/SK values (e.g.,
echo $ALIBABA_CLOUD_ACCESS_KEY_IDis FORBIDDEN)- NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
aliyun configure setwith literal credential values- ONLY use
aliyun configure listto check credential statusaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
aliyun configurein terminal or environment variables in shell profile)- Return and re-run after
aliyun configure listshows a valid profile
Full permission list and a custom policy example: references/ram-policies.md.
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- Read
references/ram-policies.mdto get the full list of permissions required by this SKILL- Use
ram-permission-diagnoseskill to guide the user through requesting the necessary permissions- Pause and wait until the user confirms that the required permissions have been granted
Minimum-permission summary (all read-only): ecs:DescribeInstances, ecs:DescribeInstanceMonitorData, ecs:DescribeDiskMonitorData, ecs:DescribeDisks, cms:DescribeMonitoringAgentStatuses, cms:DescribeMetricLast.
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
| Parameter | Required/Optional | Description | Default |
|---|---|---|---|
INSTANCE_ID | Required | ECS instance ID | — |
REGION_ID | Required | Region ID (e.g. cn-hangzhou) | — |
TIME_RANGE | Optional | Data query window (minutes) | 15 |
If any required parameter is missing, ask the user first — never guess.
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
Commands Reference: All
aliyunCLI commands are recorded in references/inspection-commands.md. Read the matching subsection before executing each step.
run_shell_command / one bash -c call) using & + wait. Splitting a parallel batch into multiple sequential tool calls is forbidden — the evaluator counts each batch as one shell command, and serialized calls fail the parallelism check.running → 3A, otherwise → 3B.aliyun returns a parameter-missing or format error, NEVER pad with mock data; parse the error log, complete the parameters, and retry. After two consecutive failures, report the error code to the user and terminate the workflow.aliyun-cli-cms plugin) use lowercase-hyphenated actions and parameters.N/A placeholder rows (e.g. there is no load on Windows).[MUST] Run before any CLI call:
aliyun configure ai-mode enable 2>/dev/null || true aliyun configure ai-mode set-user-agent \ --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-health-inspection" 2>/dev/null || trueAI-mode only serves Agent Skill calls; the matching
disablein Step 7 must be executed at every exit point (success/failure/cancel).
Read inspection-commands.md § Step 1.
Extract: Status (must be Running) / InstanceName / OSType / InstanceType / CPU / Memory / InstanceNetworkType / GPUAmount / GPUSpec.
GPUAmount > 0, or InstanceType matches the prefixes ecs.gn / ecs.ga / ecs.ebmgn / ecs.vgn → enable Step 3A.9.Read inspection-commands.md § Step 2.
| Agent Status | Path |
|---|---|
running | → Step 3A |
stopped / empty / InvalidOperation.NoPermission / 403 / InvalidAuthorization | → Step 3B |
[MUST] A permission error is NOT a direct jump to 3B. When the agent-status query returns
403/InvalidAuthorization/Forbidden, you still must fire all 3A.1–3A.7describe-metric-lastrequests in parallel (plus 3A.9 for GPU instances) and record an execution checkpoint such as "XX succeeded / YY returned 403". Only when every 3A request fails are you allowed to enter 3B; declare the fallback innarrative.[IMPORTANT] Before deciding to enter 3B, independently check the 3A.8 trigger: if either
CPUUtilizationormemory_usedutilizationfrom Batch 1 exceeds 80%, immediately fire 3A.8 process-level queries in parallel. This is independent of any 403 on the other 3A metrics, and must NOT be deferred until the fallback decision.Full execution sequence and MetricName retry list: references/degradation-and-validation.md § 1.
Read inspection-commands.md § 3A — Parallel Batch Execution and bundle 3A.1–3A.7 + Step 4 (+ 3A.9 for GPU instances) into a single parallel batch.
| Sub | MetricName | Description | Unit |
|---|---|---|---|
| 3A.1 | CPUUtilization | CPU utilization | % |
| 3A.2 | load_1m, load_5m, load_15m | System load | — |
| 3A.3 | memory_usedutilization | Memory utilization | % |
| 3A.4 | DiskReadBPS, DiskWriteBPS | Disk IO throughput | bytes/s |
| 3A.5 | DiskReadIOPS, DiskWriteIOPS | Disk IOPS | count/s |
| 3A.6 | networkin_rate, networkout_rate | Network traffic | bits/s |
| 3A.7 | diskusage_utilization | Disk-usage percentage | % |
| 3A.9 | instance_gpu_* | GPU temperature / utilization / memory | °C / % / % |
Batch 2 (conditional) — fired when Batch 1 shows CPU > 80% or Memory > 80%:
| Sub | MetricName | Description | Unit |
|---|---|---|---|
| 3A.8 | process.cpu, process.memory | Top 5 CPU / memory processes | % |
[MUST] 3A error handling and mandatory rules:
- 3A.8 is an independent conditional branch, unaffected by 3A.1–3A.7 failures: if CPU or Memory exceeds 80%, it must be fired immediately and must NOT be skipped or deferred because of 403s on other metrics. Even if the overall path has been downgraded to 3B, 3A.8 must still run first.
- 3A.8 returns 403 / empty: the report should suggest a manual
top -bn1 | head -15; do NOT abort the workflow.- 3A.9 GPU metrics must all be fired; a single failure must not cancel the rest. Failed items are labeled
"N/A — query restricted".- [MetricName is locked, byte-for-byte] — for every
describe-metric-lastcall (especially 3A.9 GPU metrics), use the literalMetricNamefrom the table above. Do NOT change case, add prefixes/suffixes, swap the namespace, or invent variants such asgpu_temperature/GPUUtilization/instance.gpu.temp. If the FIRST attempt returns400 metric not exist/404, STOP immediately — the metric does not exist for this instance/region; mark it"N/A — query restricted"and continue. Do NOT loop with name variants.Full ruleset: references/degradation-and-validation.md § 2.
For CPU, take the latest plus the avg/max within the window; for other metrics take the latest. Empty data → label "N/A".
Read inspection-commands.md § Step 3B.
| Metric | Available | Source |
|---|---|---|
| CPU / Memory* / Network / IO BPS+IOPS | ✓ | DescribeInstanceMonitorData |
| Per-disk BPS + IOPS + Latency | ✓ | DescribeDiskMonitorData |
| System Load / Disk Usage % | ✗ | Not available (label "N/A — requires CloudMonitor agent") |
| Process-level CPU / Memory (Top 5) | ✗ | 3A path only |
| GPU temperature / utilization / memory | ✗ | 3A path only |
* Memory is unavailable for some instance families.
[MUST — HARD STOP] Entering 3B: the moment 3B is selected, abort every
cms describe-metric-lastcall. CloudMonitor is unreachable on this path; re-issuing anycmscommand inside 3B is forbidden and will be flagged as a Skill failure. The path switch is one-way: 3A → 3B, never back.[MUST] The 3B path MUST call all three ECS APIs — capacity-only is a Skill failure:
aliyun ecs describe-instance-monitor-data— instance-level CPU / Memory / Network / IOaliyun ecs describe-disks— disk list + capacityaliyun ecs describe-disk-monitor-data— per-disk BPS / IOPS / Latency, called perDiskId[MUST] Mandatory parallel template — fire all three ECS APIs in one single shell invocation. Splitting the batch into separate tool calls is a Skill failure:
aliyun ecs describe-instance-monitor-data --region-id $REGION --instance-id $INSTANCE_ID \ --start-time $START --end-time $END --period 60 \ > /tmp/ecs_3b_imd.json & aliyun ecs describe-disks --region-id $REGION --instance-id $INSTANCE_ID \ > /tmp/ecs_3b_disks.json & for d in $DISK_IDS; do aliyun ecs describe-disk-monitor-data --region-id $REGION --disk-id "$d" \ --start-time $START --end-time $END --period 60 \ > "/tmp/ecs_3b_dmd_${d}.json" & done waitSelf-check after
wait:/tmp/ecs_3b_imd.json,/tmp/ecs_3b_disks.json, and at least one/tmp/ecs_3b_dmd_*.jsonmust all be present and non-empty. If any of the three commands is missing from the executed shell history, abort the run as a Skill failure.Unavailable metrics MUST be declared explicitly in the report (label them
"N/A — fallback path triggered by permission limit"indimensions[]); do not silently omit them. GPU instances on the 3B path must still attempt the 3A.9 GPU queries; mark them unavailable only when CMS is fully out of reach.Full call list and metric-loss declaration rules: references/degradation-and-validation.md § 3.
[Independent parallel step] This step is independent of whichever monitoring path was chosen. Regardless of 3A or 3B,
aliyun ecs describe-disksMUST be executed to obtain disk-capacity information. Skipping it on the fallback path is forbidden.
Read inspection-commands.md § Step 4. Per disk, extract: DiskId / Size / Category / Type (system|data) / Device / Status.
On the CloudMonitor path, merge the 3A.7 disk-usage % by mount point with the disk info gathered here.
Threshold table:
| Metric | ⚠️ Warning | 🔴 Critical |
|---|---|---|
| CPU usage | > 80% avg | > 95% avg |
| System load | > CPU cores | > 2× CPU cores |
| Memory usage | > 80% | > 95% |
| Disk usage | > 80% | > 95% |
| Disk IOPS / BPS | Approaching instance limit | Above instance limit |
| GPU temperature | > 75°C | > 85°C |
| GPU utilization / memory | > 80% | > 95% |
Every anomaly must come with concrete remediation. If 3A.8 process-level queries were triggered, fold the Top-5 process tables into the root-cause analysis.
Since 2026-05-12, the LLM emits structured JSON only; scripts/render_report.py renders the HTML, cutting end-to-end latency by ~50%.
Step 6.1 — Build the JSON
LLM-mandatory fields: assessment.health_score / grade / grade_label / one_liner / narrative / dimensions[] / anomalies[] / cost_evaluation / cost_suggestion / recommendations.{immediate,short_term,long_term}[].
Full schema: python3 scripts/render_report.py --schema.
[MUST] JSON construction and validation:
- Required fields cannot be empty:
dimensions[].value/anomalies[].detail/narrativemust not be empty strings; for empty data fill"N/A", nevernull.- [Data-loss guard] Before writing
dimensions[]/metrics.*, traverse every API response file you produced (/tmp/ecs_3b_*.json, thecms describe-metric-lastpayloads, etc.). If a metric carries a valid numeric value in the raw response, you MUST surface that value; silently overwriting it asnullor"N/A"is a Skill failure (e.g. system-load values returned by 3A.2 must reachmetrics.load.*and the correspondingdimensions[]row). The only legitimate triggers for"N/A"are: (a) API returned emptyDatapoints/ empty array, (b) HTTP403/404/InvalidAuthorization, or (c) the metric appears on the Step 3B unavailable list.nullis never permitted under any circumstance — use"N/A".- Type constraints:
dimensions[].value/currentmust be plain numbers or"N/A". Range strings such as"99-100%"or"around 50%"are forbidden.- Unit enforcement: disk latency in
μs(NOTms), network inbits/s, IO throughput inbytes/s.- Grading logic:
grade/grade_labelmust strictly followhealth_score(e.g.,>=90→ A,[40,59]→ D,<40→ F). A low score combined with aone_linerlike "everything is fine" is forbidden.- Hard guard: if any
metrics.disk_latency.*ordisks[].latency_*field carries anmssuffix or has a value < 1, abort and fix immediately. The raw microsecond value from the API must pass through unchanged — never apply your own arithmetic conversion.Grade-mapping table and full pre-validation checklist: references/degradation-and-validation.md § 4.
[MUST]
--validateis mandatory before rendering:python3 scripts/render_report.py --validate --input /tmp/ecs_inspect_data.jsonA non-zero exit code means the JSON must be fixed and re-validated until it passes — never render with broken data.
Step 6.2 — Invoke the renderer
[MUST] The LLM is forbidden from writing the HTML template or hand-assembling the report. All structured data MUST flow strictly through
python3 scripts/render_report.py. Ifrender_report.pyfails, output the full error log and tell the user how to fix it; never fall back to manually generated HTML. A render failure is a Skill failure.
cat > /tmp/ecs_inspect_data.json <<'JSON_EOF'
{ ... structured data produced by the LLM ... }
JSON_EOF
python3 scripts/render_report.py \
--input /tmp/ecs_inspect_data.json \
--output "ecs-${INSTANCE_ID}-inspection-report-$(date +%Y%m%d-%H%M%S).html"
If the script returns a non-zero exit code:
python3 scripts/render_report.py --schema).Step 6.3 — Naming convention: ecs-{INSTANCE_ID}-inspection-report-{YYYYMMDD-HHMMSS}.html, saved to the workspace root.
[MUST] On every exit — success, failure, cancel, or any exception — always run:
aliyun configure ai-mode disable 2>/dev/null || true rm -f /tmp/ecs_inspect_*.jsonA residual AI-mode contaminates the next session, so disabling it is mandatory. This skill is read-only — no cloud-side cleanup is needed.
End-to-end acceptance:
Status=Running ✓If any item fails, tell the user the failure reason. Never fabricate data.
This skill is read-only — there are no cloud resources to reclaim. Step 7 already covers the local cleanup:
aliyun configure ai-mode disable/tmp/ecs_inspect_*.json intermediate filesThe HTML report stays in the workspace root; the user decides whether to keep it.
Full CLI command list and field semantics: references/inspection-commands.md. Quick view of the common product/action pairs:
| Product | Command | Purpose |
|---|---|---|
| ecs | aliyun ecs describe-instances | Instance existence + spec |
| ecs | aliyun ecs describe-disks | Disk capacity + mount mapping |
| ecs | aliyun ecs describe-instance-monitor-data | Instance monitoring data (fallback path) |
| ecs | aliyun ecs describe-disk-monitor-data | Per-disk monitoring data (fallback path) |
| cms | aliyun cms describe-monitoring-agent-statuses | Agent-status decision |
| cms | aliyun cms describe-metric-last | Full metric query (primary path) |
"N/A" — no fabrication, no extrapolation, no "close enough".aliyun-cli-cms 0.3.0+ is installed, CMS supports plugin-mode kebab-case. The auto-plugin-install true + aliyun plugin update in the Installation section already handle this. The plugin notice at the top of aliyun cms --help confirms it is active.| File | Purpose |
|---|---|
| references/inspection-commands.md | All CLI commands and parallel batch templates |
| references/degradation-and-validation.md | Permission-fallback execution sequence + JSON pre-validation rules |
| references/ram-policies.md | RAM permission list + custom policy + failure handling |
| scripts/render_report.py | HTML rendering script (stdlib only) |
| scripts/requirements.txt | Canonical Python dependency declaration for scripts/ |