Install
openclaw skills install @huaweiclouddev/huawei-cloud-dws-dymem-diagDWS cluster memory high root cause diagnosis skill, based on KooCLI v3.2.0+ and DWS Autopilot MCP Server. Automatically collects memory metrics, analyzes root causes (customer-side / system-side), and outputs a standardized diagnosis report. Applicable to DWS cluster memory usage too high, memory alarms, OOM scenarios. Trigger words: "内存高", "内存告警", "内存诊断", "内存使用率过高", "内存不足", "OOM", "内存溢出", "动态内存使用率超阈值", "high memory", "memory alarm", "memory diagnosis"
openclaw skills install @huaweiclouddev/huawei-cloud-dws-dymem-diagThis skill is dedicated to DWS cluster memory high root cause diagnosis. When a cluster triggers a memory usage too high alarm, it automatically collects metric data, analyzes root causes (customer-side / system-side), and outputs a standardized diagnosis report.
Architecture: KooCLI (hcloud) → DWS Autopilot API → Cluster monitoring metrics; MCP Server (dws_autopilot) → Fallback channel for the same API
Applicable Scenarios:
Typical Use Cases:
Important Rules: All diagnosis conclusions must come from actual tool return results. Fabricating or assuming values is prohibited. Output only contains the diagnosis report; adding remediation suggestions, outputting SQL optimization statements, and using emoji are prohibited.
Background Knowledge: Memory usage formulas, CN/DN instance distinction, memory_pool analysis, memory thresholds, and important principles are documented in Memory Background Knowledge. Must read before diagnosis.
hcloud versionhcloud configure list to check credential status# Query metric data
hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=<name> --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
# Query host information
hcloud DWS ListHostOverview --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
Choose between KooCLI and MCP Server, preferring KooCLI. Step 0 checks hcloud availability:
Fallback and Termination Strategy:
NETWORK_ERROR (connection timeout, network unreachable, etc.), automatically fall back to MCP Server and use MCP mode for all subsequent stepsKooCLI network unavailable and MCP Server connection failed. Please check KooCLI network configuration or DWS Autopilot MCP Server configuration and retryMCP Server authentication failed. Please check DWS Autopilot MCP Server authentication configuration and retryOnce a tool is selected, use it throughout without switching (except for NETWORK_ERROR fallback). If a call fails, retry once (maximum 2 attempts). If still failing, mark the metric as "unavailable" and continue to the next step. When all metric queries fail, generate the diagnosis report directly.
| Common Parameter | hcloud Parameter | MCP Parameter |
|---|---|---|
| Region | --cli-region | (built into MCP connection) |
| Project ID | --project_id | project_id |
| Cluster ID | --cluster_id | cluster_id |
| Metric Name | --metric_name | metric_name |
| Start Time | --from | from_ts |
| End Time | --to | to_ts |
| Pagination Offset | --offset | offset |
| Pagination Limit | --limit | limit |
| Sort Field | (not supported) | order_by |
| Sort Direction | (not supported) | sort_by |
| Tool Name | Purpose | Key Parameters |
|---|---|---|
dws_autopilot_get_clusters | Query cluster list | project_id, cluster_id, limit, offset |
dws_autopilot_get_hosts | Query host information | project_id, cluster_id, limit, offset |
dws_autopilot_get_metric | Query metric data | project_id, cluster_id, metric_name, from_ts, to_ts, limit, offset, order_by, sort_by |
metric_data Parameter Notes: metric_data does not support filtering by instance_name (no such parameter); query returns full cluster data and must be filtered by inst_name field for target instances; metric_data does not support period parameter (sampling period is automatically determined by the platform).
Available metric_name: MemStat, InstanceMemory, memory_diagnose_detail
Time Protocol: from_ts/to_ts must use Unix millisecond timestamps; all times are in UTC timezone; recommended time window: from 20 minutes before alarm time to alarm time (from_ts = first_alarm_time - 1200000).
Return Format: Success {"code": 0, "data": [...]}; Failure {"code": -1, "message": "error description"}. On failure, retry once; if still failing, use degradation path and mark as "unavailable" in the report.
Key Fields per Metric:
For query differences per step, see Metric Reference.
All tool calls (MCP and hcloud) must use paginated queries to prevent single responses from being too large and exceeding token limits.
Pagination Rules:
limit=200 uniformly (do not use 800 or other large values)offset=0; if returned data count = 200, then offset+=200 and continue queryingMCP Call Example (using MemStat):
Page 1: dws_autopilot_get_metric(project_id, cluster_id, metric_name="MemStat", from_ts, to_ts, limit=200, offset=0)
If returned data length = 200:
Page 2: dws_autopilot_get_metric(project_id, cluster_id, metric_name="MemStat", from_ts, to_ts, limit=200, offset=200)
If returned data length < 200: Stop pagination, merge page 1 + page 2 data
hcloud Call Example (using MemStat):
Page 1: hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=MemStat --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
If returned data count = 200:
Page 2: hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<id> --metric_name=MemStat --project_id=<pid> --offset=200 --limit=200 --from=<from_ts> --to=<to_ts>
If returned data count < 200: Stop pagination, merge data
Before diagnosis, create an execution plan based on Steps 0-7, then execute sequentially. For tool selection strategy, see the "KooCLI Command Format Standard" section; subsequent steps will not repeat this. All MCP tool calls follow the "Pagination Specification" section; subsequent steps will not repeat pagination details.
Execute hcloud version; version >= 3.2.0 → tool_mode=hcloud, otherwise tool_mode=mcp.
hcloud Network Availability Probe: If tool_mode=hcloud, execute a lightweight API call (e.g., hcloud DWS ListClusters --cli-region=<region> --project_id=<project_id> --offset=0 --limit=1) to verify network connectivity:
Call metric query with metric_name="MemStat", time window: from_ts=first_alarm_time - 1200000, to_ts=first_alarm_time. In MCP mode, use limit=200 paginated query.
Autopilot Unavailable Determination: Returns 50201/RDS.9999 error → Skip Steps 1-6, mark all metrics as "unavailable", proceed directly to Step 7.
Parsing: Group by host_id, find the latest memory data for each node. Memory usage = (mem_total - mem_free - cached - buffers) / mem_total * 100%. Determine if too high (> 80%), globally high (all > 70%), imbalanced (deviation > 30%). Find the two nodes with highest memory and TOP3.
Node Scope Classification (based on high_mem_nodes count):
Output: mem_by_host, max_mem_hosts, problem_host_id, problem_host_mem_usage, problem_host_used, problem_host_total, problem_host_available, top3_mem_nodes, cluster_avg_mem, is_imbalanced, is_global_high, mem_deviation, high_mem_nodes, mem_scope
Call host information query (MCP mode limit=200 paginated), build host_id → {host_name, ip} mapping table. Only query host information for nodes involved in max_mem_hosts from Step 1 output.
Output: host_id_to_info_map, problem_host_ip, node_name
Call metric query with metric_name="InstanceMemory", time window same as Step 1. In MCP mode, use limit=200 paginated query. Note: metric_data does not support filtering by instance_name; query returns full cluster data, must filter by inst_name field for target instances.
Parsing: Extract memory usage for each instance (CN/DN). Dynamic memory usage = (dynamic_used_memory / max_dynamic_memory) * 100%; Process memory usage = (process_used_memory / max_process_memory) * 100%. Find instances with highest memory usage.
CN/DN Instance Distinction (based on inst_name field, InstanceMemory has no instance_type field):
Memory Type Determination (based on InstanceMemory dynamic memory vs process memory ratio):
Output: instance_memory_data, high_memory_instances, instance_type_distribution, top3_dynamic_instances, inst_type, mem_type
Call metric query with metric_name="memory_diagnose_detail", time window same as Step 1. Note: metric_data does not support filtering by instance_name; query returns full cluster data, must filter by max_mem_hosts from Step 1 output. MCP mode must use limit=200 paginated query (this metric has the largest data volume), until returned count < 200, merge all paginated data before filtering.
Parsing: Extract active query statements, execution users (userName), memory usage, session information, SQL-level memory statistics.
Time Annotation Rule: ctime in memory_diagnose_detail is the Autopilot collection snapshot time, not the actual SQL start time. If active_sessions contains duration_ms field, SQL start time = ctime - duration_ms, annotated as "start time"; if duration_ms is unavailable, use ctime directly, annotated as "collection time" (do not annotate collection time as "start time").
Active User Statistics Rule: Only count users and connections with state=active; idle state not counted. Group by userName and aggregate memory usage to identify top users.
Output: user_memory_top5, session_memory_top5, sql_memory_top5, total_memory_by_users, total_memory_by_sqls, high_memory_sql_detected, high_memory_sql_info, high_freq_queries, heavy_queries, memory_pool_data, idle_session_with_high_mem
Based on data collected in Steps 1-4, combined with user identity for memory high cause analysis.
Diagnosis Priority: Look at scope first, then find causes; business anomalies first, configuration and system last.
Time Formatting: All Unix millisecond timestamps (first_alarm_time, ctime, etc.) are in UTC timezone. In the report, they must be converted to Beijing time (UTC+8) string YYYY-MM-DD HH:MM:SS. Can use python -c "from datetime import datetime,timezone,timedelta; print(datetime.fromtimestamp({ms}/1000,tz=timezone(timedelta(hours=8))).strftime('%Y-%m-%d %H:%M:%S'))". Do not mentally calculate timestamp values.
User Identity Judgment (based on database user, i.e., memory_diagnose_detail active_sessions userName):
memory_pool Analysis (from Step 4 memory_pool_data):
Note: Single SQL and multi-user concurrent can both be matched simultaneously, each listed as an independent anomaly item, not mutually exclusive.
Comprehensive Judgment Rules:
| Condition | Marker |
|---|---|
| All nodes usage > 70% | Cluster memory globally high |
| Inter-node deviation > 30% | Cluster memory load imbalanced, possible data skew |
| Single instance dynamic memory usage significantly abnormal | Single instance memory anomaly |
| high_memory_sql_detected = true | Single SQL causing high memory, specific SQL identified |
| omm/Ruby proportion in user_memory_top5 > 30% | System internal tasks consuming high memory (system-side) |
| MemStat/InstanceMemory time series persistently monotonically increasing | Possible memory leak |
| work_mem usage rate > 80% | work_mem configuration too large or SQL sort/hash spill |
| shared_pool usage rate > 80% | Shared cache pressure, excessive connections |
Statistics and Aggregation Requirements:
width = Math.round(ratio / max_ratio * 200), where max_ratio is the top user's ratio. Only count database users with state=active. If fewer than 3 data points, list actual count.Output: root_cause_category, mem_scope, inst_type, mem_type, summary, high_memory_sql_info, session_info, top3_memory_users, top3_memory_statements, high_freq_queries, heavy_queries, memory_pool_summary, concurrent_mode
Prioritize getting cluster name from input parameter cluster_name. If empty, call dws_autopilot_get_clusters with project_id and cluster_id to get cluster_name. If call also fails, use cluster_id as resource_name.
Output: resource_name, resource_id
Generate an HTML report following the template in the "Output Format" section. After generating the report, save the HTML file to the current working directory (workspace root folder) with the filename dws_mem_diagnosis_report_{timestamp}.html, where {timestamp} is the current machine local time formatted as yyyyMMdd_HHmmss (e.g., dws_mem_diagnosis_report_20260623_150421.html).
# hcloud
hcloud DWS ListClusters --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
# MCP
dws_autopilot_get_clusters(project_id=<pid>)
# hcloud
hcloud DWS ListHostOverview --cli-region=<region> --project_id=<pid> --offset=0 --limit=200
# MCP
dws_autopilot_get_hosts(project_id=<pid>, cluster_id=<cid>, limit=200, offset=0)
# hcloud
hcloud DWS ListMetricsData --cli-region=<region> --cluster_id=<cid> --metric_name=<name> --project_id=<pid> --offset=0 --limit=200 --from=<from_ts> --to=<to_ts>
# MCP
dws_autopilot_get_metric(project_id=<pid>, cluster_id=<cid>, metric_name=<name>, from_ts=<from>, to_ts=<to>, limit=200, offset=0, order_by="ctime", sort_by="desc")
| Parameter | Required/Optional | Description | Default |
|---|---|---|---|
| alarm_serial_number | Required | Alarm serial number | N/A |
| project_id | Required | Project ID | N/A |
| cluster_id | Required | Cluster ID | N/A |
| first_alarm_time | Required | First alarm time (millisecond timestamp) | N/A |
| alarm_name | Required | Alarm name | N/A |
| region_id | Optional | Region identifier | N/A |
| node_name | Optional | Alert node name | Empty (cluster-level alarm) |
| instance_name | Optional | Instance name | Empty |
| cluster_name | Optional | Cluster name | Use cluster_id |
| alarm_severity | Optional | Alarm severity | N/A |
Strictly output and return according to the template in Output Format. Do not analyze or summarize the template content, do not omit any part, do not modify the template structure. The output must be consistent with the template.
python -c "from datetime import datetime,timezone,timedelta; print(datetime.fromtimestamp({ms}/1000,tz=timezone(timedelta(hours=8))).strftime('%Y-%m-%d %H:%M:%S'))"| Document | Description |
|---|---|
| CLI Installation Guide | KooCLI installation and configuration |
| MCP Installation Guide | DWS Autopilot MCP Server installation and configuration |
| IAM Policies | Required permissions and policy JSON |
| Metric Reference | Metric key fields and query differences |
| Memory Background Knowledge | Memory formulas, CN/DN distinction, memory_pool analysis, thresholds |
| Output Format | HTML template and fill rules |