Install
openclaw skills install alibabacloud-ecs-reboot-or-crash-diagnosisDiagnose ECS instance reboot or crash issues. First checks for abnormal maintenance events, then uses Cloud Assistant to check for internal restarts or kernel panics. Use this skill when users report ECS instance unexpected reboot, crash, abnormal shutdown, kernel panic, or OOM. Supports vmcore file analysis, kdump configuration, system log analysis, and Windows crash dump analysis.
openclaw skills install alibabacloud-ecs-reboot-or-crash-diagnosisDiagnose root cause of ECS instance unexpected reboot or crash. Uses standard workflow: check platform maintenance events first, then check internal system logs. Supports both Linux and Windows systems.
Before starting diagnosis, must obtain the following parameters from user:
| Parameter | Description | Example |
|---|---|---|
INSTANCE_ID | ECS instance ID | i-bp1a2b3c4d5e6f7g8h9j |
REGION_ID | Region ID | cn-hangzhou |
If user does not provide any of the above parameters, must ask user first. Do not start diagnosis.
DescribeCloudAssistantStatus. If not running, provide alternative diagnostic approaches.references/output-format.md, output strictly according to template structure. No free-form output, no omitted sections, no changed hierarchy. Every placeholder {...} in the template must be filled with actual data.Before using aliyun CLI commands, must configure AI-Mode:
# Enable AI-Mode
aliyun configure ai-mode enable
# Set user-agent for skill identification
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-reboot-or-crash-diagnosis"
# Update plugins
aliyun plugin update
After diagnosis complete, disable AI-Mode:
aliyun configure ai-mode disable
Credentials must be pre-configured outside of agent session. Agent only verifies:
aliyun configure list
DescribeCloudAssistantStatus APISee RAM Policies for the complete permission list and custom policy example.
Verify instance exists and get basic information:
aliyun ecs describe-instances \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-ids '["<INSTANCE_ID>"]'
Confirm from returned JSON:
RegionId — Region ID (matches user provided)Status — Instance status (Running/Stopped)InstanceName — Instance nameOSType — Operating system type (windows / linux)Record OSType for Step 3 branch selection.
Query instance historical system events to determine if platform maintenance caused reboot:
aliyun ecs describe-instance-history-events \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID> \
--event-cycle-status Executed
Event Analysis:
| Event Type | Meaning | Determination | Next Step |
|---|---|---|---|
SystemMaintenance.Reboot | Reboot caused by system maintenance | Platform-initiated maintenance | Inform user, no further investigation needed |
SystemFailure.Reboot | Reboot caused by underlying hardware/system failure | Platform infrastructure failure | Suggest instance migration or contact support |
InstanceFailure.Reboot | Reboot caused by instance-level failure | Instance internal issue detected by platform | Must continue to Step 3 for system log check |
InstanceExpiration.Stop | Instance stopped due to expiration | Billing issue | Need renewal, no further investigation |
| No relevant events | No platform maintenance events found | Not platform-initiated | Continue to Step 3 |
Important Notes for InstanceFailure.Reboot:
If maintenance event found:
If no maintenance event found:
Before executing diagnostic commands, verify Cloud Assistant is running:
aliyun ecs describe-cloud-assistant-status \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID>
Check the response:
{
"InstanceCloudAssistantStatusSet": {
"InstanceCloudAssistantStatus": [
{
"InstanceId": "i-xxx",
"RegionId": "cn-xxx",
"CloudAssistantStatus": "true",
"LastHeartbeatTime": "2026-04-09T07:26:58Z"
}
]
}
}
Important Notes:
CloudAssistantStatus is a string ("true"/"false"), not booleanLastHeartbeatTime to ensure it's recent (within last few minutes)kdump, crash files named vmcore-*kdump-tools, crash files named dump.* and dmesg.*If CloudAssistantStatus is false or command fails:
If CloudAssistantStatus is true:
Execute Linux diagnostic script via Cloud Assistant to check:
last reboot, /var/log/messages or /var/log/syslog)dmesg)vm.panic_on_oom configurationComplete diagnostic commands: see diagnostic-commands.md
Linux Result Analysis:
| Finding | Possible Cause | Suggestion |
|---|---|---|
| Kernel Panic + crash dump (vmcore/dump.*) | Kernel crash, dump file generated | Read dmesg.* file for panic reason, contact Alibaba Cloud technical support for deep analysis |
| Kernel Panic + no crash dump | Kernel crash, but kdump not configured or not working | Proceed to Step 5: Recommend Kdump configuration for future crash capture |
| OOM + panic_on_oom=1 | OOM triggered kernel panic | Disable panic_on_oom or increase memory |
| OOM Killer | Memory insufficient causing process killed | Optimize memory usage or upgrade instance type |
| SysRq triggered crash | Manual crash trigger via /proc/sysrq-trigger | Check if intentional test, review bash history and audit logs |
| Normal reboot records | User or program triggered reboot | Check cron jobs or ops scripts |
| No abnormal records | No system-level issues found | May be external factors, suggest monitoring |
Before executing diagnostic commands, verify Cloud Assistant is running:
aliyun ecs describe-cloud-assistant-status \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID>
Check the response:
CloudAssistantStatus: true — Cloud Assistant is running, proceed to Step 3B.2CloudAssistantStatus: false — Cloud Assistant is not running
Execute Windows diagnostic script via Cloud Assistant to check:
Complete diagnostic commands: see diagnostic-commands.md
Windows Result Analysis:
| Finding | Possible Cause | Suggestion |
|---|---|---|
| Event 41 (Kernel-Power) | Unexpected shutdown/crash | Check for BSOD, dump files |
| Dump configured + dump file exists | System crashed and captured dump | Contact Alibaba Cloud technical support for dump file analysis |
| Dump configured + no dump file | Crash occurred but no dump captured | Check pagefile and disk space |
| Dump not configured | Crash dumps disabled | Enable memory dump for diagnosis |
| BSOD events found | Blue screen crash occurred | Check bug check code in dump |
| No abnormal events | No system-level crash records | May be power issue or external factor |
After executing diagnostic script via RunCommand, query the execution result:
aliyun ecs describe-invocations \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID> \
--invoke-id <INVOKE_ID>
Important Notes:
--instance-id (not --instance-id.1) for describe-invocations APIInvokeId is returned by the RunCommand API callOutput field from Base64 to get diagnostic resultsInvokeStatus to ensure command execution completed successfullyIf Step 3 found crash dump files (vmcore on Linux, MEMORY.DMP/minidump on Windows), perform preliminary analysis.
Complete analysis commands: see diagnostic-commands.md
Important: If Linux vmcore files need deep analysis or Windows dump files (MEMORY.DMP/minidump) are found, recommend the user contact Alibaba Cloud technical support team for professional crash dump analysis assistance.
If Step 3A found Kernel Panic records but no vmcore files, must advise user to configure Kdump.
/var/crash has no vmcore filesinactive or failed/proc/cmdline does not contain crashkernel= parameterWhy Kdump is needed: Without Kdump, kernel crashes will not generate vmcore files, making root cause analysis impossible.
Configuration requirements:
crashkernel= kernel parameter/var/crash (or configured path)Configuration reference: Provide guidance from diagnostic-commands.md
RHEL/CentOS/Alibaba Cloud Linux:
yum install -y kexec-toolscrashkernel=auto to kernel parameters in /etc/default/grubgrub2-mkconfig -o /boot/grub2/grub.cfgsystemctl enable --now kdumpUbuntu/Debian:
apt-get install -y kdump-toolsUSE_KDUMP=1 in /etc/default/kdump-toolsupdate-grub (crashkernel parameter usually auto-added)systemctl status kdump-toolsIf Step 3B found BSOD events but no dump files:
CrashDumpEnabled registry value is not 0After all diagnostic steps complete, must do both of the following:
references/output-format.md — Get complete output format template