Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Alibabacloud Ecs Reboot Or Crash Diagnosis

v0.0.1

Diagnose ECS instance reboot or crash issues. First checks for abnormal maintenance events, then uses Cloud Assistant to check for internal restarts or kerne...

0· 44· 1 versions· 0 current· 0 all-time· Updated 1d ago· MIT-0
byalibabacloud-skills-team@sdk-team

ECS Instance Reboot/Crash Diagnosis

Diagnose root cause of ECS instance unexpected reboot or crash. Uses standard workflow: check platform maintenance events first, then check internal system logs. Supports both Linux and Windows systems.

Required Parameters

Before starting diagnosis, must obtain the following parameters from user:

ParameterDescriptionExample
INSTANCE_IDECS instance IDi-bp1a2b3c4d5e6f7g8h9j
REGION_IDRegion IDcn-hangzhou

If user does not provide any of the above parameters, must ask user first. Do not start diagnosis.

Mandatory Execution Rules

  1. Must obtain parameters first — Instance ID and Region ID are required. Must ask user if missing.
  2. Standard workflow cannot be skipped — Must execute in order: Maintenance Event Check → OSType Detection → System Log Check
  3. Must check Cloud Assistant status before diagnostics — Before executing Step 3A/3B, must verify Cloud Assistant is running via DescribeCloudAssistantStatus. If not running, provide alternative diagnostic approaches.
  4. All diagnostic conclusions must be based on actual data — No fabrication, speculation, or assumptions
  5. Output format must be strictly followed — After diagnosis, must read the complete template in references/output-format.md, output strictly according to template structure. No free-form output, no omitted sections, no changed hierarchy. Every placeholder {...} in the template must be filled with actual data.

Prerequisites

CLI Tools

  • aliyun-cli 3.3.3+ (required) — For calling Alibaba Cloud API
  • Installation & configuration: see CLI Installation Guide

AI-Mode Configuration (Required)

Before using aliyun CLI commands, must configure AI-Mode:

# Enable AI-Mode
aliyun configure ai-mode enable

# Set user-agent for skill identification
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-reboot-or-crash-diagnosis"

# Update plugins
aliyun plugin update

After diagnosis complete, disable AI-Mode:

aliyun configure ai-mode disable

Alibaba Cloud Credentials

Credentials must be pre-configured outside of agent session. Agent only verifies:

aliyun configure list

Instance Requirements

  • Cloud Assistant client must be installed and running on the instance
  • Instance status must be Running
  • Note: If Cloud Assistant is not running, diagnostic commands cannot be executed remotely. Must provide manual diagnostic steps to user.

Required RAM Permissions

See RAM Policies for the complete permission list and custom policy example.


Step 1: Confirm Instance Information (Cannot Skip)

Verify instance exists and get basic information:

aliyun ecs describe-instances \
  --biz-region-id <REGION_ID> \
  --region <REGION_ID> \
  --instance-ids '["<INSTANCE_ID>"]'

Confirm from returned JSON:

  • RegionId — Region ID (matches user provided)
  • Status — Instance status (Running/Stopped)
  • InstanceName — Instance name
  • OSType — Operating system type (windows / linux)

Record OSType for Step 3 branch selection.


Step 2: Check ECS Maintenance Events

Query instance historical system events to determine if platform maintenance caused reboot:

aliyun ecs describe-instance-history-events \
  --biz-region-id <REGION_ID> \
  --region <REGION_ID> \
  --instance-id <INSTANCE_ID> \
  --event-cycle-status Executed

Event Analysis:

Event TypeMeaningDeterminationNext Step
SystemMaintenance.RebootReboot caused by system maintenancePlatform-initiated maintenanceInform user, no further investigation needed
SystemFailure.RebootReboot caused by underlying hardware/system failurePlatform infrastructure failureSuggest instance migration or contact support
InstanceFailure.RebootReboot caused by instance-level failureInstance internal issue detected by platformMust continue to Step 3 for system log check
InstanceExpiration.StopInstance stopped due to expirationBilling issueNeed renewal, no further investigation
No relevant eventsNo platform maintenance events foundNot platform-initiatedContinue to Step 3

Important Notes for InstanceFailure.Reboot:

  • This event indicates the platform detected an instance-level anomaly and triggered automatic recovery
  • Common causes: kernel panic, OOM, system hang, critical process failure
  • Must execute Step 3 to check system logs for root cause
  • Even if no obvious errors in logs, the instance may have been unresponsive at kernel level

If maintenance event found:

  • Clearly inform user of reboot cause (event type, time, reason)
  • Provide handling suggestions
  • End diagnosis flow

If no maintenance event found:

  • Continue to Step 3, check internal system logs based on OSType

Step 3A: Linux System Diagnosis (Execute when OSType is linux)

Step 3A.1: Check Cloud Assistant Status (Mandatory)

Before executing diagnostic commands, verify Cloud Assistant is running:

aliyun ecs describe-cloud-assistant-status \
  --biz-region-id <REGION_ID> \
  --region <REGION_ID> \
  --instance-id <INSTANCE_ID>

Check the response:

{
  "InstanceCloudAssistantStatusSet": {
    "InstanceCloudAssistantStatus": [
      {
        "InstanceId": "i-xxx",
        "RegionId": "cn-xxx",
        "CloudAssistantStatus": "true",
        "LastHeartbeatTime": "2026-04-09T07:26:58Z"
      }
    ]
  }
}

Important Notes:

  • CloudAssistantStatus is a string ("true"/"false"), not boolean
  • Check LastHeartbeatTime to ensure it's recent (within last few minutes)
  • Even if status is "true", RunCommand may still fail if service is unstable
  • Always check RunCommand execution result and handle failures gracefully
  • Ubuntu vs RHEL differences:
    • RHEL/CentOS/Alibaba Cloud Linux: Service name is kdump, crash files named vmcore-*
    • Ubuntu/Debian: Service name is kdump-tools, crash files named dump.* and dmesg.*
    • Diagnostic script now checks both service names and all crash file types

If CloudAssistantStatus is false or command fails:

  • Cloud Assistant is not installed or not running on the instance
  • Cannot proceed with remote diagnostic commands
  • Alternative approaches:
    1. Guide user to SSH into the instance and check logs manually
    2. Provide manual diagnostic commands for user to execute
    3. Suggest installing Cloud Assistant: Installation Guide
    4. Check instance monitoring data via CloudMonitor API

If CloudAssistantStatus is true:

  • Proceed to Step 3A.2

Step 3A.2: Execute Linux Diagnostic Script

Execute Linux diagnostic script via Cloud Assistant to check:

  • System reboot records (last reboot, /var/log/messages or /var/log/syslog)
  • Kernel Panic records (dmesg)
  • OOM records and vm.panic_on_oom configuration
  • Kdump configuration and crash dump file status
  • Crash dump files: vmcore (RHEL/CentOS) or dump./dmesg. (Ubuntu/Debian)

Complete diagnostic commands: see diagnostic-commands.md

Linux Result Analysis:

FindingPossible CauseSuggestion
Kernel Panic + crash dump (vmcore/dump.*)Kernel crash, dump file generatedRead dmesg.* file for panic reason, contact Alibaba Cloud technical support for deep analysis
Kernel Panic + no crash dumpKernel crash, but kdump not configured or not workingProceed to Step 5: Recommend Kdump configuration for future crash capture
OOM + panic_on_oom=1OOM triggered kernel panicDisable panic_on_oom or increase memory
OOM KillerMemory insufficient causing process killedOptimize memory usage or upgrade instance type
SysRq triggered crashManual crash trigger via /proc/sysrq-triggerCheck if intentional test, review bash history and audit logs
Normal reboot recordsUser or program triggered rebootCheck cron jobs or ops scripts
No abnormal recordsNo system-level issues foundMay be external factors, suggest monitoring

Step 3B: Windows System Diagnosis (Execute when OSType is windows)

Step 3B.1: Check Cloud Assistant Status (Mandatory)

Before executing diagnostic commands, verify Cloud Assistant is running:

aliyun ecs describe-cloud-assistant-status \
  --biz-region-id <REGION_ID> \
  --region <REGION_ID> \
  --instance-id <INSTANCE_ID>

Check the response:

  • CloudAssistantStatus: true — Cloud Assistant is running, proceed to Step 3B.2
  • CloudAssistantStatus: false — Cloud Assistant is not running
    • Cannot proceed with remote diagnostic commands
    • Guide user to SSH/RDP into instance and run diagnostics manually
    • Suggest reinstalling Cloud Assistant: Windows Installation Guide

Step 3B.2: Execute Windows Diagnostic Script

Execute Windows diagnostic script via Cloud Assistant to check:

  • System uptime and unexpected shutdown events (Event ID 41, 1074, 6008, 6006)
  • Memory dump configuration and pagefile settings
  • MEMORY.DMP and minidump files existence
  • BSOD events and application crashes

Complete diagnostic commands: see diagnostic-commands.md

Windows Result Analysis:

FindingPossible CauseSuggestion
Event 41 (Kernel-Power)Unexpected shutdown/crashCheck for BSOD, dump files
Dump configured + dump file existsSystem crashed and captured dumpContact Alibaba Cloud technical support for dump file analysis
Dump configured + no dump fileCrash occurred but no dump capturedCheck pagefile and disk space
Dump not configuredCrash dumps disabledEnable memory dump for diagnosis
BSOD events foundBlue screen crash occurredCheck bug check code in dump
No abnormal eventsNo system-level crash recordsMay be power issue or external factor

Step 3.5: Get Cloud Assistant Command Output (Required after Step 3)

After executing diagnostic script via RunCommand, query the execution result:

aliyun ecs describe-invocations \
  --biz-region-id <REGION_ID> \
  --region <REGION_ID> \
  --instance-id <INSTANCE_ID> \
  --invoke-id <INVOKE_ID>

Important Notes:

  • Use --instance-id (not --instance-id.1) for describe-invocations API
  • The InvokeId is returned by the RunCommand API call
  • Decode the Output field from Base64 to get diagnostic results
  • Check InvokeStatus to ensure command execution completed successfully

Step 4: Analyze Crash Dump Files

If Step 3 found crash dump files (vmcore on Linux, MEMORY.DMP/minidump on Windows), perform preliminary analysis.

Complete analysis commands: see diagnostic-commands.md

Important: If Linux vmcore files need deep analysis or Windows dump files (MEMORY.DMP/minidump) are found, recommend the user contact Alibaba Cloud technical support team for professional crash dump analysis assistance.


Step 5: Recommend Kdump Configuration (If Not Configured)

If Step 3A found Kernel Panic records but no vmcore files, must advise user to configure Kdump.

When to Recommend Kdump Configuration

  • Kernel panic records found in dmesg or system logs, but /var/crash has no vmcore files
  • Kdump service status shows inactive or failed
  • /proc/cmdline does not contain crashkernel= parameter

Key Points to Communicate

  1. Why Kdump is needed: Without Kdump, kernel crashes will not generate vmcore files, making root cause analysis impossible.

  2. Configuration requirements:

    • Reserve memory for crash kernel via crashkernel= kernel parameter
    • Enable and start the kdump (RHEL/CentOS) or kdump-tools (Ubuntu/Debian) service
    • Ensure sufficient disk space in /var/crash (or configured path)
  3. Configuration reference: Provide guidance from diagnostic-commands.md

Kdump Configuration Steps Summary

RHEL/CentOS/Alibaba Cloud Linux:

  1. Install: yum install -y kexec-tools
  2. Add crashkernel=auto to kernel parameters in /etc/default/grub
  3. Run grub2-mkconfig -o /boot/grub2/grub.cfg
  4. Reboot the instance
  5. Enable: systemctl enable --now kdump

Ubuntu/Debian:

  1. Install: apt-get install -y kdump-tools
  2. Set USE_KDUMP=1 in /etc/default/kdump-tools
  3. Run update-grub (crashkernel parameter usually auto-added)
  4. Reboot the instance
  5. Verify: systemctl status kdump-tools

Windows Memory Dump Configuration

If Step 3B found BSOD events but no dump files:

  1. Verify pagefile is configured and has sufficient size
  2. Enable memory dump: System Properties → Advanced → Startup and Recovery → Settings
  3. Select "Automatic memory dump" or "Kernel memory dump"
  4. Ensure CrashDumpEnabled registry value is not 0

Final Output (Must execute after diagnosis complete)

After all diagnostic steps complete, must do both of the following:

  1. Read references/output-format.md — Get complete output format template
  2. Output strictly according to template structure — Choose corresponding template based on actual result

References

Version tags

latestvk978391fa15sp1t3mswv7k75d185mmch