Alibabacloud Ecs Reboot Or Crash Diagnosis
v0.0.1Diagnose ECS instance reboot or crash issues. First checks for abnormal maintenance events, then uses Cloud Assistant to check for internal restarts or kerne...
ECS Instance Reboot/Crash Diagnosis
Diagnose root cause of ECS instance unexpected reboot or crash. Uses standard workflow: check platform maintenance events first, then check internal system logs. Supports both Linux and Windows systems.
Required Parameters
Before starting diagnosis, must obtain the following parameters from user:
| Parameter | Description | Example |
|---|---|---|
INSTANCE_ID | ECS instance ID | i-bp1a2b3c4d5e6f7g8h9j |
REGION_ID | Region ID | cn-hangzhou |
If user does not provide any of the above parameters, must ask user first. Do not start diagnosis.
Mandatory Execution Rules
- Must obtain parameters first — Instance ID and Region ID are required. Must ask user if missing.
- Standard workflow cannot be skipped — Must execute in order: Maintenance Event Check → OSType Detection → System Log Check
- Must check Cloud Assistant status before diagnostics — Before executing Step 3A/3B, must verify Cloud Assistant is running via
DescribeCloudAssistantStatus. If not running, provide alternative diagnostic approaches. - All diagnostic conclusions must be based on actual data — No fabrication, speculation, or assumptions
- Output format must be strictly followed — After diagnosis, must read the complete template in
references/output-format.md, output strictly according to template structure. No free-form output, no omitted sections, no changed hierarchy. Every placeholder{...}in the template must be filled with actual data.
Prerequisites
CLI Tools
- aliyun-cli 3.3.3+ (required) — For calling Alibaba Cloud API
- Installation & configuration: see CLI Installation Guide
AI-Mode Configuration (Required)
Before using aliyun CLI commands, must configure AI-Mode:
# Enable AI-Mode
aliyun configure ai-mode enable
# Set user-agent for skill identification
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-reboot-or-crash-diagnosis"
# Update plugins
aliyun plugin update
After diagnosis complete, disable AI-Mode:
aliyun configure ai-mode disable
Alibaba Cloud Credentials
Credentials must be pre-configured outside of agent session. Agent only verifies:
aliyun configure list
Instance Requirements
- Cloud Assistant client must be installed and running on the instance
- Alibaba Cloud Linux: Pre-installed by default
- Ubuntu/CentOS/Other: May require manual installation, check with
DescribeCloudAssistantStatusAPI - Installation guide: https://help.aliyun.com/document_detail/64930.html
- Instance status must be Running
- Note: If Cloud Assistant is not running, diagnostic commands cannot be executed remotely. Must provide manual diagnostic steps to user.
Required RAM Permissions
See RAM Policies for the complete permission list and custom policy example.
Step 1: Confirm Instance Information (Cannot Skip)
Verify instance exists and get basic information:
aliyun ecs describe-instances \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-ids '["<INSTANCE_ID>"]'
Confirm from returned JSON:
RegionId— Region ID (matches user provided)Status— Instance status (Running/Stopped)InstanceName— Instance nameOSType— Operating system type (windows / linux)
Record OSType for Step 3 branch selection.
Step 2: Check ECS Maintenance Events
Query instance historical system events to determine if platform maintenance caused reboot:
aliyun ecs describe-instance-history-events \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID> \
--event-cycle-status Executed
Event Analysis:
| Event Type | Meaning | Determination | Next Step |
|---|---|---|---|
SystemMaintenance.Reboot | Reboot caused by system maintenance | Platform-initiated maintenance | Inform user, no further investigation needed |
SystemFailure.Reboot | Reboot caused by underlying hardware/system failure | Platform infrastructure failure | Suggest instance migration or contact support |
InstanceFailure.Reboot | Reboot caused by instance-level failure | Instance internal issue detected by platform | Must continue to Step 3 for system log check |
InstanceExpiration.Stop | Instance stopped due to expiration | Billing issue | Need renewal, no further investigation |
| No relevant events | No platform maintenance events found | Not platform-initiated | Continue to Step 3 |
Important Notes for InstanceFailure.Reboot:
- This event indicates the platform detected an instance-level anomaly and triggered automatic recovery
- Common causes: kernel panic, OOM, system hang, critical process failure
- Must execute Step 3 to check system logs for root cause
- Even if no obvious errors in logs, the instance may have been unresponsive at kernel level
If maintenance event found:
- Clearly inform user of reboot cause (event type, time, reason)
- Provide handling suggestions
- End diagnosis flow
If no maintenance event found:
- Continue to Step 3, check internal system logs based on OSType
Step 3A: Linux System Diagnosis (Execute when OSType is linux)
Step 3A.1: Check Cloud Assistant Status (Mandatory)
Before executing diagnostic commands, verify Cloud Assistant is running:
aliyun ecs describe-cloud-assistant-status \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID>
Check the response:
{
"InstanceCloudAssistantStatusSet": {
"InstanceCloudAssistantStatus": [
{
"InstanceId": "i-xxx",
"RegionId": "cn-xxx",
"CloudAssistantStatus": "true",
"LastHeartbeatTime": "2026-04-09T07:26:58Z"
}
]
}
}
Important Notes:
CloudAssistantStatusis a string ("true"/"false"), not boolean- Check
LastHeartbeatTimeto ensure it's recent (within last few minutes) - Even if status is "true", RunCommand may still fail if service is unstable
- Always check RunCommand execution result and handle failures gracefully
- Ubuntu vs RHEL differences:
- RHEL/CentOS/Alibaba Cloud Linux: Service name is
kdump, crash files namedvmcore-* - Ubuntu/Debian: Service name is
kdump-tools, crash files nameddump.*anddmesg.* - Diagnostic script now checks both service names and all crash file types
- RHEL/CentOS/Alibaba Cloud Linux: Service name is
If CloudAssistantStatus is false or command fails:
- Cloud Assistant is not installed or not running on the instance
- Cannot proceed with remote diagnostic commands
- Alternative approaches:
- Guide user to SSH into the instance and check logs manually
- Provide manual diagnostic commands for user to execute
- Suggest installing Cloud Assistant: Installation Guide
- Check instance monitoring data via CloudMonitor API
If CloudAssistantStatus is true:
- Proceed to Step 3A.2
Step 3A.2: Execute Linux Diagnostic Script
Execute Linux diagnostic script via Cloud Assistant to check:
- System reboot records (
last reboot,/var/log/messagesor/var/log/syslog) - Kernel Panic records (
dmesg) - OOM records and
vm.panic_on_oomconfiguration - Kdump configuration and crash dump file status
- Crash dump files: vmcore (RHEL/CentOS) or dump./dmesg. (Ubuntu/Debian)
Complete diagnostic commands: see diagnostic-commands.md
Linux Result Analysis:
| Finding | Possible Cause | Suggestion |
|---|---|---|
| Kernel Panic + crash dump (vmcore/dump.*) | Kernel crash, dump file generated | Read dmesg.* file for panic reason, contact Alibaba Cloud technical support for deep analysis |
| Kernel Panic + no crash dump | Kernel crash, but kdump not configured or not working | Proceed to Step 5: Recommend Kdump configuration for future crash capture |
| OOM + panic_on_oom=1 | OOM triggered kernel panic | Disable panic_on_oom or increase memory |
| OOM Killer | Memory insufficient causing process killed | Optimize memory usage or upgrade instance type |
| SysRq triggered crash | Manual crash trigger via /proc/sysrq-trigger | Check if intentional test, review bash history and audit logs |
| Normal reboot records | User or program triggered reboot | Check cron jobs or ops scripts |
| No abnormal records | No system-level issues found | May be external factors, suggest monitoring |
Step 3B: Windows System Diagnosis (Execute when OSType is windows)
Step 3B.1: Check Cloud Assistant Status (Mandatory)
Before executing diagnostic commands, verify Cloud Assistant is running:
aliyun ecs describe-cloud-assistant-status \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID>
Check the response:
CloudAssistantStatus: true— Cloud Assistant is running, proceed to Step 3B.2CloudAssistantStatus: false— Cloud Assistant is not running- Cannot proceed with remote diagnostic commands
- Guide user to SSH/RDP into instance and run diagnostics manually
- Suggest reinstalling Cloud Assistant: Windows Installation Guide
Step 3B.2: Execute Windows Diagnostic Script
Execute Windows diagnostic script via Cloud Assistant to check:
- System uptime and unexpected shutdown events (Event ID 41, 1074, 6008, 6006)
- Memory dump configuration and pagefile settings
- MEMORY.DMP and minidump files existence
- BSOD events and application crashes
Complete diagnostic commands: see diagnostic-commands.md
Windows Result Analysis:
| Finding | Possible Cause | Suggestion |
|---|---|---|
| Event 41 (Kernel-Power) | Unexpected shutdown/crash | Check for BSOD, dump files |
| Dump configured + dump file exists | System crashed and captured dump | Contact Alibaba Cloud technical support for dump file analysis |
| Dump configured + no dump file | Crash occurred but no dump captured | Check pagefile and disk space |
| Dump not configured | Crash dumps disabled | Enable memory dump for diagnosis |
| BSOD events found | Blue screen crash occurred | Check bug check code in dump |
| No abnormal events | No system-level crash records | May be power issue or external factor |
Step 3.5: Get Cloud Assistant Command Output (Required after Step 3)
After executing diagnostic script via RunCommand, query the execution result:
aliyun ecs describe-invocations \
--biz-region-id <REGION_ID> \
--region <REGION_ID> \
--instance-id <INSTANCE_ID> \
--invoke-id <INVOKE_ID>
Important Notes:
- Use
--instance-id(not--instance-id.1) for describe-invocations API - The
InvokeIdis returned by theRunCommandAPI call - Decode the
Outputfield from Base64 to get diagnostic results - Check
InvokeStatusto ensure command execution completed successfully
Step 4: Analyze Crash Dump Files
If Step 3 found crash dump files (vmcore on Linux, MEMORY.DMP/minidump on Windows), perform preliminary analysis.
Complete analysis commands: see diagnostic-commands.md
Important: If Linux vmcore files need deep analysis or Windows dump files (MEMORY.DMP/minidump) are found, recommend the user contact Alibaba Cloud technical support team for professional crash dump analysis assistance.
Step 5: Recommend Kdump Configuration (If Not Configured)
If Step 3A found Kernel Panic records but no vmcore files, must advise user to configure Kdump.
When to Recommend Kdump Configuration
- Kernel panic records found in dmesg or system logs, but
/var/crashhas no vmcore files - Kdump service status shows
inactiveorfailed /proc/cmdlinedoes not containcrashkernel=parameter
Key Points to Communicate
-
Why Kdump is needed: Without Kdump, kernel crashes will not generate vmcore files, making root cause analysis impossible.
-
Configuration requirements:
- Reserve memory for crash kernel via
crashkernel=kernel parameter - Enable and start the kdump (RHEL/CentOS) or kdump-tools (Ubuntu/Debian) service
- Ensure sufficient disk space in
/var/crash(or configured path)
- Reserve memory for crash kernel via
-
Configuration reference: Provide guidance from diagnostic-commands.md
Kdump Configuration Steps Summary
RHEL/CentOS/Alibaba Cloud Linux:
- Install:
yum install -y kexec-tools - Add
crashkernel=autoto kernel parameters in/etc/default/grub - Run
grub2-mkconfig -o /boot/grub2/grub.cfg - Reboot the instance
- Enable:
systemctl enable --now kdump
Ubuntu/Debian:
- Install:
apt-get install -y kdump-tools - Set
USE_KDUMP=1in/etc/default/kdump-tools - Run
update-grub(crashkernel parameter usually auto-added) - Reboot the instance
- Verify:
systemctl status kdump-tools
Windows Memory Dump Configuration
If Step 3B found BSOD events but no dump files:
- Verify pagefile is configured and has sufficient size
- Enable memory dump: System Properties → Advanced → Startup and Recovery → Settings
- Select "Automatic memory dump" or "Kernel memory dump"
- Ensure
CrashDumpEnabledregistry value is not 0
Final Output (Must execute after diagnosis complete)
After all diagnostic steps complete, must do both of the following:
- Read
references/output-format.md— Get complete output format template - Output strictly according to template structure — Choose corresponding template based on actual result
References
- Output Format — Diagnostic result output template
- Common Scenarios — Typical problem diagnosis examples
- Diagnostic Commands — Complete diagnostic scripts and analysis commands
