Install
openclaw skills install @huaweiclouddev/huawei-cloud-dws-io-diagDWS cluster I/O overload root cause diagnosis skill, based on KooCLI v3.2.0+ and DWS Autopilot MCP Server. Automatically collects I/O metrics, analyzes root causes (customer-side / system-side) via three-stage decision tree, and outputs a standardized diagnosis report. Applicable to DWS cluster I/O usage too high, I/O alarms, disk I/O load anomaly scenarios. Trigger words: "I/O高", "I/O告警", "I/O诊断", "I/O过载", "IO高", "IO告警", "IO诊断", "IO过载", "磁盘IO负载异常", "util打满", "await高", "high I/O", "I/O alarm", "I/O diagnosis"
openclaw skills install @huaweiclouddev/huawei-cloud-dws-io-diagThis skill is dedicated to DWS cluster I/O overload root cause diagnosis. When a cluster triggers an I/O usage too high alarm, it automatically collects metric data, analyzes root causes (customer-side / system-side) via a three-stage decision tree, and outputs a standardized diagnosis report.
Architecture: KooCLI (hcloud) → DWS Autopilot API → Cluster monitoring metrics; MCP Server (dws_autopilot) → Fallback channel for the same API
Applicable Scenarios:
Typical Use Cases:
Important Rules: All diagnosis conclusions must come from actual tool return results. Fabricating or assuming values is prohibited. Output only contains the diagnosis report; adding remediation suggestions, outputting SQL optimization statements, and using emoji are prohibited.
Background Knowledge: RAID architecture, I/O key metric thresholds, three major I/O scenario indicator features, and important principles are documented in I/O Background Knowledge. Must read before diagnosis.
hcloud versionhcloud configure list to check credential status# Query metric data
hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=<name> --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
# Query host information
hcloud DWS ListHostOverview --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
Choose between KooCLI and MCP Server, preferring KooCLI. Step 0 checks hcloud availability:
Fallback and Termination Strategy:
NETWORK_ERROR (connection timeout, network unreachable, etc.), automatically fall back to MCP Server and use MCP mode for all subsequent stepsKooCLI network unavailable and MCP Server connection failed. Please check KooCLI network configuration or DWS Autopilot MCP Server configuration and retryMCP Server authentication failed. Please check DWS Autopilot MCP Server authentication configuration and retryOnce a tool is selected, use it throughout without switching (except for NETWORK_ERROR fallback). If a call fails, retry once (maximum 2 attempts). If still failing, mark the metric as "unavailable" and continue to the next step. When all metric queries fail, generate the diagnosis report directly.
| Common Parameter | hcloud Parameter | MCP Parameter |
|---|---|---|
| Region | --cli-region | (built into MCP connection) |
| Project ID | --project_id | project_id |
| Cluster ID | --cluster_id | cluster_id |
| Metric Name | --metric_name | metric_name |
| Start Time | --from | from_ts |
| End Time | --to | to_ts |
| Pagination Offset | --offset | offset |
| Pagination Limit | --limit | limit |
| Sort Field | (not supported) | order_by |
| Sort Direction | (not supported) | sort_by |
| Tool Name | Purpose | Key Parameters |
|---|---|---|
dws_autopilot_get_clusters | Query cluster list | project_id, cluster_id, limit, offset |
dws_autopilot_get_hosts | Query host information | project_id, cluster_id, limit, offset |
dws_autopilot_get_metric | Query metric data | project_id, cluster_id, metric_name, from_ts, to_ts, limit, offset, order_by, sort_by |
metric_data Parameter Notes: metric_data does not support filtering by instance_name (no such parameter); query returns full cluster data and must be filtered by inst_name field for target instances; metric_data does not support period parameter (sampling period is automatically determined by the platform).
Available metric_name: IOStat, CpuStat, cpu_io_diagnose_detail, business_concurrency, business_query_monitor, business_thread_wait, bussiness_conflict_lock
Time Protocol: from_ts/to_ts must use Unix millisecond timestamps; all times are in UTC timezone; recommended time window: from 5 minutes before alarm time to alarm time (from_ts = first_alarm_time - 300000).
Return Format: Success {"code": 0, "data": [...]}; Failure {"code": -1, "message": "error description"}. On failure, retry once; if still failing, use degradation path and mark as "unavailable" in the report.
Key Fields per Metric:
For query differences per step, see Metric Reference.
All tool calls (MCP and hcloud) must use paginated queries to prevent single responses from being too large and exceeding token limits.
Pagination Rules:
limit=200 uniformly (do not use 800 or other large values)offset=0; if returned data count = 200, then offset+=200 and continue queryingMCP Call Example (using IOStat):
Page 1: dws_autopilot_get_metric(project_id, cluster_id, metric_name="IOStat", from_ts, to_ts, limit=200, offset=0)
If returned data length = 200:
Page 2: dws_autopilot_get_metric(project_id, cluster_id, metric_name="IOStat", from_ts, to_ts, limit=200, offset=200)
If returned data length < 200: Stop pagination, merge page 1 + page 2 data
hcloud Call Example (using IOStat):
Page 1: hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=IOStat --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
If returned data count = 200:
Page 2: hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=IOStat --project_id=<pid> --offset=200 --limit=200 --from=<from_ts> --to=<to_ts>
If returned data count < 200: Stop pagination, merge data
Before diagnosis, create an execution plan based on Steps 0-7, then execute sequentially. For tool selection strategy, see the "KooCLI Command Format Standard" section; subsequent steps will not repeat this. All MCP tool calls follow the "Pagination Specification" section; subsequent steps will not repeat pagination details.
Parameter Resolution: If project_id or region_id is not provided by the user, read from MCP Server config file conf/dws_config.yaml:
project_id is missing: Execute python -c "import yaml; c=yaml.safe_load(open('conf/dws_config.yaml')); print(c.get('project_id',''))" to obtain itregion_id is missing: Execute python -c "import yaml; c=yaml.safe_load(open('conf/dws_config.yaml')); print(c.get('region_id',''))" to obtain itExecute hcloud version; version >= 3.2.0 → tool_mode=hcloud, otherwise tool_mode=mcp.
hcloud Network Availability Probe: If tool_mode=hcloud, execute a lightweight API call (e.g., hcloud DWS ListClusters --cli-region=<region> --project_id=<project_id> --offset=0 --limit=1) to verify network connectivity:
Call metric query with metric_name="IOStat", time window: from_ts=first_alarm_time - 300000, to_ts=first_alarm_time. In MCP mode, use limit=200 paginated query.
Autopilot Unavailable Determination: Returns 50201/RDS.9999 error → Skip Steps 1-6, mark all metrics as "unavailable", proceed directly to Step 7.
Parsing: Group by host_id and disk_name, find the latest I/O data for each disk on each node. Extract util, await, r_await, w_await, r_s, w_s, read_iops, write_iops, rMB_s, wMB_s, read_throughput, write_throughput, avgrq_sz, avgqu_sz for each disk.
I/O Overload Determination: Whether any disk has util > 90% or await > 100 or throughput/IOPS reaching upper limit.
I/O Imbalance Determination: Whether the deviation of util or await between nodes > 20%.
TOP Node Filtering: Find the two nodes with the highest I/O load.
Phenomenon Distribution Determination: Whether a node has single disk I/O high or all disks I/O high (used for Step 4 scenario routing).
High IO Time Points: From the IOStat time series data, find all ctime points where any disk util > 90% on the top 2 IO nodes. Collect these ctime values as high_io_ctimes (list of millisecond timestamps). Also record the ctime with the highest util as peak_io_ctime. If no ctime has util > 90%, high_io_ctimes is empty and peak_io_ctime is the ctime with the highest util value.
Output: io_by_host_disk, io_by_host, max_io_hosts, problem_host_id, problem_host_util, problem_host_await, problem_host_avgrq_sz, problem_host_avgqu_sz, cluster_avg_util, is_imbalanced, io_deviation, high_io_nodes, single_disk_high_nodes, all_disk_high_nodes, high_io_ctimes, peak_io_ctime
Call host information query (MCP mode limit=200 paginated), build host_id → {host_name, ip} mapping table. Only query host information for nodes involved in max_io_hosts from Step 1 output.
Output: host_id_to_info_map, problem_host_ip, node_name
Call metric query with metric_name="CpuStat". In MCP mode, use limit=200 paginated query.
Parsing: Extract iowait for each node. iowait > 30% indicates significant I/O pressure. High iowait but low CPU usage → typical I/O bottleneck.
Output: cpu_by_host, high_iowait_nodes
Based on Step 1 IOStat data and Step 3 CpuStat data, determine I/O scenario via three-stage decision tree.
Precondition: await persistently > 100 and avgqu_sz persistently > 100?
Stage 1: Clear I/O Anomaly — IOPS and throughput far below minimum specification?
Stage 2: Clear I/O Overload — IOPS or throughput exceeds maximum specification?
Stage 3: Gray Zone — I/O close to minimum specification, or exceeds minimum but below maximum?
Phenomenon Distribution Determination (based on Step 1 IOStat data, analyzed by host_id + disk_name dimension):
I/O Type Determination (for overload scenario routing):
Output: io_scenario, io_type, phenomenon_distribution, is_gray_zone, gray_zone_result
5a: Query Business Concurrency
Call metric query with metric_name="business_concurrency". In MCP mode, use limit=200 paginated query.
Output: max_concurrency, avg_concurrency, is_high_concurrency
5b: Query I/O Diagnosis Details for Top 2 Nodes
Call dws_service_autopilot_metric_data, parameters same as Step 3, metric_name changed to "cpu_io_diagnose_detail". Note: metric_data does not support filtering by host_id; query returns full cluster data, then filter by the host_id of the top 2 IO nodes.
Paginated Query: This metric has the largest data volume (multi-node × multi-instance × multi-timepoint × multi-query), a single page of 800 may not be enough. First page offset=0, limit=800; if returned data count = 800, then offset+=800 and continue querying, until returned data < 800, then merge all paginated data.
High IO Time Period Filtering: From the returned data, only keep sampling points where ctime ∈ {{steps.iostat_metrics.output.high_io_ctimes}}, excluding irrelevant queries from normal time periods. If high_io_ctimes is empty (no high IO time points), take the data from the time point with the highest util, annotate as "IO did not persistently exceed threshold, the following is query reference at the highest IO moment".
Parse returned data, extract for each node:
Spill to Disk Determination Rule: Infer based on active statement characteristics, mark spill_detected=true if any of the following conditions is met:
wait file write wait eventfull_scan_queries (reuse this field to record spill SQL), spill_info records details of the first spill SQLFull Table Scan Determination Rule: Infer based on active statement characteristics, mark full_scan_detected=true if the following condition is met:
full_scan_queriesData Skew Inference Rule: Infer based on IOStat inter-node IO deviation, non-confirmatory judgment:
is_data_skewed=trueIO Contribution TopN Statistics Rule (time window aggregation, aligned with DMS monitoring item logic):
io_data_available=false. Note: io_data_available=true is a rare scenario (only when cluster nodes have deployed pidstat collection plugin and collection is normal, io_read/io_write will have non-zero values, in most cases this field is 0), therefore io_data_available=false is the normcpu_rate < 20% AND io_read < 1024 AND io_write < 1024, filter out that row (AND relationship, all three below threshold to filter, any one exceeding threshold is retained), retained rows participate in TopN statisticsTime Annotation Rule: ctime returned by cpu_io_diagnose_detail is the Autopilot collection snapshot time, not the actual SQL start time. If active_queries contains duration_ms field, then SQL start time = ctime - duration_ms, annotated as "start time"; if duration_ms is unavailable, use ctime directly, annotated as "collection time" (do not annotate collection time as "start time").
Active User Statistics Rule: Only count users and connections with state=active; idle state not counted. Group by userName and count active query numbers, identify top users.
Output (steps.top_node_io_diagnose.output):
io_query_details: # Node IO query details [{host_id, queries, io_stats}]
spill_detected: # Whether spill to disk is detected (boolean)
spill_info: # Spill information {query, query_id, username, host_id, io_write_rate}
full_scan_detected: # Whether full table scan is detected (boolean)
full_scan_queries: # Full table scan query list [{query, query_id, username, host_id, duration_ms}]
is_data_skewed: # Whether data skew is possible (inferred from IOStat inter-node deviation, non-confirmatory) (boolean)
skew_info: # Skew inference information {max_deviation_node, min_deviation_node, deviation_pct, description} (inferred from IOStat inter-node IO deviation, marked when deviation > 20%, cannot confirm specific table names)
topn_sql_io: # Top3 SQL IO contribution (time window aggregation) [{query_preview, query_id, user_name, total_io_read, total_io_write, frequency, cpu_rate, io_characteristic}]
topn_user_io: # Top3 User IO contribution (time window aggregation) [{user_name, total_io_read, total_io_write, frequency}]
io_data_available: # Whether io_read/io_write data is available (boolean)
peak_io_ctime: # Highest util time point ctime in high IO time period
5c: Query Wait Events and Lock Conflicts
Call metric query with metric_name="business_thread_wait" and metric_name="bussiness_conflict_lock". In MCP mode, use limit=200 paginated query.
Parsing: Identify I/O-related wait events (e.g., wait wal sync, WALWriteLock, I/O wait). Identify lock conflicts causing I/O anomalies.
Output: wait_events, io_wait_detected, lock_conflicts, lock_conflict_detected
Based on data collected in Steps 1-5, combined with Step 4 three-stage decision tree result and phenomenon distribution, route to corresponding scenario and execute full investigation direction analysis.
Time Formatting: All Unix millisecond timestamps (first_alarm_time, ctime, etc.) are in UTC timezone. In the report, they must be converted to Beijing time (UTC+8) string YYYY-MM-DD HH:MM:SS. Can use python -c "from datetime import datetime,timezone,timedelta; print(datetime.fromtimestamp({ms}/1000,tz=timezone(timedelta(hours=8))).strftime('%Y-%m-%d %H:%M:%S'))". Do not mentally calculate timestamp values.
User Identity Judgment (based on database user, i.e., cpu_io_diagnose_detail username):
Based on Step 4's io_scenario and phenomenon_distribution, route to the corresponding scenario and execute full investigation. Must read IO_DIAGNOSIS_REF.md for the complete scenario routing table, investigation direction list, customer-side/system cause descriptions and output examples; do not analyze based solely on the scenario overview below.
Scenario Overview:
Comprehensive Judgment Rules:
| Condition | Marker |
|---|---|
| problem_host_util > 90% | Disk utilization too high |
| problem_host_await > 100 | I/O response time too long |
| io_deviation > 20% | Cluster I/O load imbalanced, possible data skew |
| is_high_concurrency = true and spill_detected = true | High concurrency spill to disk causing high I/O |
| full_scan_detected = true | Full table scan causing high I/O, specific statement identified |
| io_wait_detected = true | I/O-related wait events detected |
| lock_conflict_detected = true | Lock conflict affecting I/O performance |
Statistics and Aggregation Requirements:
No Anomaly Determination: When Step 4 three-stage decision tree determines that the I/O scenario is normal (i.e., no scenario determination needed, util far below 90%, await far below 100ms, iowait far below 30%), then io_scenario=normal, no scenario routing is needed, and the diagnosis conclusion is "No anomaly detected in diagnosis". In this case:
root_cause_category = "normal", io_scenario = "normal", all other flags (spill_detected, full_scan_detected, io_wait_detected, lock_conflict_detected, is_data_skewed) = falseio_investigation_info = emptytopn_sql_io and topn_user_io are still calculated per Step 5b TopN statistics rules, annotated as reference information rather than root cause evidenceOutput: root_cause_category, io_scenario, phenomenon_distribution, io_type, summary, specific_sql, is_data_skewed, spill_detected, full_scan_detected, io_wait_detected, lock_conflict_detected, io_investigation_info, topn_sql_io, topn_user_io, io_data_available
Generate an HTML report following the template in the "Output Format" section. After generating the report, save the HTML file to the current working directory (workspace root folder) with the filename dws_io_diagnosis_report_{timestamp}.html, where {timestamp} is the current machine local time formatted as yyyyMMdd_HHmmss (e.g., dws_io_diagnosis_report_20260623_150421.html).
In addition to the HTML report, output {{diagnosis_json}} data structure. Must read diagnosis_json Format Reference for the complete data structure, content field format rules (io_data_available=true/false), system-side/no-anomaly examples, and Ruby user constraint.
Report Section Fill Rules for No Anomaly:
io_scenario=normal (diagnosis conclusion is "No anomaly detected"):
diagnosis_summary: Output <div class="conclusion-item normal">诊断暂未发现异常</div><div class="conclusion-item normal">诊断时间范围: {from_time} ~ {to_time}</div>io_investigation_content: Display current active query reference information, annotated as "IO未发现异常,以下为当前活跃查询参考信息". Follow the same io_data_available=true/false rendering rules as the anomaly scenario, but replace "IO排查方向" with "IO参考信息" and annotate as reference rather than investigation directiontopn_sql_io_content and topn_user_io_content: Follow the same io_data_available=true/false rendering rules as the anomaly scenario, but annotate as reference information rather than root cause evidencediagnosis_json: content = "诊断暂未发现异常", addition.advice = "暂无"io_scenario is not normal (anomaly detected):
io_investigation_content: Fill according to io_data_available=true/false rules# hcloud
hcloud DWS ListClusters --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
# MCP
dws_autopilot_get_clusters(project_id=<pid>)
# hcloud
hcloud DWS ListHostOverview --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
# MCP
dws_autopilot_get_hosts(project_id=<pid>, cluster_id=<cid>, limit=200, offset=0)
# hcloud
hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<cid> --metric_name=<name> --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
# MCP
dws_autopilot_get_metric(project_id=<pid>, cluster_id=<cid>, metric_name=<name>, from_ts=<from>, to_ts=<to>, limit=200, offset=0, order_by="ctime", sort_by="desc")
| Parameter | Required/Optional | Description | Default |
|---|---|---|---|
| alarm_serial_number | Required | Alarm serial number | N/A |
| cluster_id | Required | Cluster ID | N/A |
| first_alarm_time | Required | First alarm time (millisecond timestamp) | N/A |
| alarm_name | Required | Alarm name | N/A |
| project_id | Optional | Project ID | Read from MCP Server config file (conf/dws_config.yaml) project_id field |
| region_id | Optional | Region identifier | Read from MCP Server config file (conf/dws_config.yaml) region_id field |
| node_name | Optional | Alert node name | Empty (cluster-level alarm) |
| cluster_name | Optional | Cluster name | Use cluster_id |
| alarm_severity | Optional | Alarm severity | N/A |
Parameter Resolution Rules: When project_id or region_id is not provided by the user, read from MCP Server config file conf/dws_config.yaml:
project_id: Read the project_id field from conf/dws_config.yamlregion_id: Read the region_id field from conf/dws_config.yamlpython -c "import yaml; c=yaml.safe_load(open('conf/dws_config.yaml')); print(c.get('project_id',''))" for project_id, and python -c "import yaml; c=yaml.safe_load(open('conf/dws_config.yaml')); print(c.get('region_id',''))" for region_idStrictly output and return according to the template in Output Format. Do not analyze or summarize the template content, do not omit any part, do not modify the template structure. The output must be consistent with the template.
python -c "from datetime import datetime,timezone,timedelta; print(datetime.fromtimestamp({ms}/1000,tz=timezone(timedelta(hours=8))).strftime('%Y-%m-%d %H:%M:%S'))"| Document | Description |
|---|---|
| CLI Installation Guide | KooCLI installation and configuration |
| MCP Installation Guide | DWS Autopilot MCP Server installation and configuration |
| IAM Policies | Required permissions and policy JSON |
| Metric Reference | Metric key fields and query differences |
| I/O Background Knowledge | RAID architecture, I/O thresholds, scenario features, important principles |
| Output Format | HTML template and fill rules |
| diagnosis_json Format | diagnosis_json data structure, content field format rules, Ruby user constraint |
| I/O Diagnosis Reference | Scenario routing table, investigation directions, and output examples |