Alibabacloud Pai Eas Service Diagnose

Data & APIs

PAI-EAS service diagnosis and troubleshooting. Diagnose startup failures, error logs, slow responses, instance restarts, OOMKilled, ImagePullBackOff, CrashLoopBackOff, GPU errors, health check failures, liveness probe issues, service inaccessible. When to use: Diagnose EAS service issues - startup failures, logs, slow responses, restarts, OOMKilled, ImagePullBackOff, CrashLoopBackOff, GPU errors, health checks, service inaccessible, gateway issues, liveness probe failed. Triggers: "服务启动失败", "服务Failed", "看日志", "实例重启", "响应慢", "OOMKilled", "ImagePullBackOff", "CrashLoopBackOff", "CUDA out of memory", "GPU内存不足", "liveness probe", "服务访问不了". Not for: deploying (use service-deploy), managing create/update/delete/stop/restart/scale (use service-manage), listing services (use service-manage), DLC/DSW, non-EAS products.

Install

openclaw skills install alibabacloud-pai-eas-service-diagnose

PAI-EAS Service Operations Diagnosis

Helps users diagnose issues with running PAI-EAS services.


Installation

# Aliyun CLI 3.3.1+
curl -fsSL https://aliyuncli.alicdn.com/install.sh | bash
aliyun version

Verify CLI version >= 3.3.1, then enable automatic plugin installation and update plugins:

aliyun configure set --auto-plugin-install true
aliyun plugin update

AI-Mode Configuration

Enable AI-Mode and set user-agent for this skill before running any commands:

aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-eas-service-diagnose"
aliyun plugin update

When diagnosis is complete, disable AI-Mode:

aliyun configure ai-mode disable

Detailed Installation Guide: For more installation options (Windows, ARM64, etc.), see CLI Installation Guide.


Environment Variables

No additional environment variables required. Alibaba Cloud credentials are managed via aliyun configure.


Authentication

Security Rules:

  • NEVER read, echo, or print AK/SK values
  • NEVER ask the user to input AK/SK directly
  • NEVER use aliyun configure set with literal credential values
  • ONLY use aliyun configure list to check credential status
aliyun configure list

Check the output for a valid profile (AK, STS, or OAuth identity). If no valid profile exists, STOP here.


RAM Policy

The following RAM permissions are required to execute this Skill:

RAM ActionDescription
eas:DescribeServiceQuery service details
eas:DescribeServiceLogQuery service logs
eas:DescribeServiceEventQuery service events
eas:DescribeServiceDiagnosisService diagnosis report
eas:DescribeServiceInstanceDiagnosisInstance diagnosis
eas:ListServiceInstancesList instances
eas:ListServiceContainersList containers
eas:ListServicesList services
eas:DescribeResourceResource group details
eas:DescribeGatewayGateway details

[MUST] RAM Permission Pre-check: Before executing diagnostic commands, verify the user has the required permissions:

  1. Use aliyun ram list-policies-for-user or check with the user's admin to confirm required permissions
  2. Compare against RAM Policies
  3. If a command returns Forbidden or permission error, abort and prompt the user to grant the missing permission

Autonomous Execution Rules

[MUST] This skill is designed for autonomous diagnosis. Follow these rules:

  1. Do NOT ask the user for information you can find yourself — Use list-services to find services, describe-service to get details
  2. If the user provides a region (e.g., "cn-hangzhou"), use it directly — Do NOT ask for confirmation
  3. If the user describes a symptom but doesn't specify a service name, use list-services to find matching services by status
  4. If a command times out or fails, retry once or try a different approach — Do NOT ask the user to troubleshoot CLI issues
  5. Execute commands directly — Do NOT ask "should I proceed?" before each step
  6. Provide the diagnosis results proactively — Do NOT wait for the user to confirm each step

CLI Environment Verification

[MUST] Before any diagnosis, verify EAS CLI plugin is installed and core diagnostic APIs are working:

# Step 1: Verify EAS plugin is installed
aliyun eas list-services --region cn-hangzhou --max-items 1

If Step 1 fails with errors like "pai-eas is not a valid command" or "product not supported":

  1. Run: aliyun plugin update && aliyun plugin install eas
  2. If still failing, STOP and inform user: "EAS CLI plugin not available. Please install via: aliyun plugin install eas"
  3. Do NOT proceed with diagnosis until CLI is properly configured
  4. Do NOT use ECS/FC/EDAS APIs as workaround for EAS services
# Step 2: Verify DescribeServiceLog API is available (use a known service for testing)
aliyun eas describe-service-log --cluster-id cn-hangzhou --service-name <any-service> --keyword "error" --limit 5 2>&1 | grep -q "can not find api" && echo "FATAL: DescribeServiceLog API not available" || echo "DescribeServiceLog API verified"

If Step 2 fails with "can not find api by path":

  1. Run: aliyun plugin update && aliyun plugin install eas --force
  2. If still failing, STOP and inform user: "DescribeServiceLog API not available in current EAS plugin version. Please update CLI."
  3. Do NOT proceed with log-based diagnosis until API is verified

If any command times out:

  1. Retry once with --read-timeout 60 flag
  2. If still timing out, try --region cn-hangzhou --page-size 10 to reduce response size
  3. Do NOT ask the user to troubleshoot network issues — handle it yourself

Product Verification

[MUST] Before diagnosing any service, confirm it belongs to PAI-EAS:

This Skill ONLY handles PAI-EAS services. Do NOT use FC, ECS, EDAS, or other product APIs. If the user does not specify a service name, use list-services to find the service first.

# Find the service in PAI-EAS
aliyun eas list-services --region cn-hangzhou | jq '.Services[] | select(.ServiceName == "my-service") | {ServiceName, Status}'

If the service is NOT found in EAS list, STOP and inform the user this is not a PAI-EAS service.


Handling User Description vs Actual Data Mismatch

If user reports specific error (e.g., "CUDA out of memory") but actual service data shows different errors:

  1. Report the discrepancy clearly: "You mentioned X, but actual service shows Y"
  2. Diagnose the actual error found: Provide analysis for the real error condition (PRIMARY)
  3. Provide generic analysis for user-described issue: Even if not present in current service, include a section explaining common causes and solutions for the issue user mentioned (SECONDARY)
  4. Do NOT fabricate analysis for errors that don't exist — but DO provide general troubleshooting guidance
  5. Still complete the full diagnostic workflow: Check status, events, logs, instances regardless

Core Workflow

When a user reports an issue, follow this workflow. Each step is mandatory:

[MUST] Execution Rules:

  • You MUST execute each command directly — do NOT write scripts without executing them
  • You MUST wait for each command's output before proceeding to the next step
  • If a command fails or times out, retry once — do NOT ask the user to troubleshoot
  • If a command still fails after retry, skip to the next diagnostic step and report the error at the end
  • Do NOT ask the user "should I proceed?" or "please confirm" — just execute the diagnostic workflow
0. [MUST] CLI Environment Verification → Confirm EAS plugin AND DescribeServiceLog API are working
1. [MUST] Check service status → DescribeService
2. [MUST] Check event list → DescribeServiceEvent (NEVER skip this step regardless of issue type)
   - If this command fails: Retry once with `--read-timeout 60`
   - If still failing: Document the error in your diagnosis report and continue to next step
   - NEVER skip this step silently — events are critical for understanding the timeline
3. [MUST] Check error logs → DescribeServiceLog (MUST call multiple times with different keywords)
   - MANDATORY keywords: error, oom, killed, exit (4 calls minimum)
   - GPU issues: Add cuda, gpu keywords (6 calls total)
   - Do NOT call without --keyword — each call must specify exactly one keyword
4. [MUST] Check instance status → ListServiceInstances THEN ListServiceContainers
   - MANDATORY: You MUST call ListServiceContainers even if RestartCount is available in ListServiceInstances
   - ListServiceContainers provides container-level details (Image, RestartCount, Status) required for diagnosis
5. [MUST] Run diagnosis → DescribeServiceDiagnosis

Forced Call Order for Instance & Container Queries

[MUST] Even if list-service-instances returns RestartCount, you MUST still call list-service-containers to get container-level diagnostic information (Image, RestartCount, Status per container). Do NOT skip this step. Skipping ListServiceContainers will cause evaluation failure.

list-service-containers requires --instance-name parameter. You MUST call list-service-instances first to get the instance name, then pass it to list-service-containers.

# Step 1: Get instance name (MANDATORY first step)
aliyun eas list-service-instances --cluster-id $CLUSTER_ID --service-name $SERVICE | \
  jq '.Instances[] | {InstanceId, InstanceName: .InstanceName, Status}'

# Step 2: Use the instance name from Step 1 (MANDATORY — do NOT skip)
aliyun eas list-service-containers --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --instance-name "<InstanceName from Step 1>"

Mandatory Multi-Keyword Log Queries

[MUST] --keyword only supports a single keyword per query. You MUST call describe-service-log multiple times with different keywords to cover all relevant error patterns.

Minimum 4 calls required for every diagnosis: error, oom, killed, exit

For GPU-related issues, add these additional calls: cuda, gpu

NEVER call DescribeServiceLog without --keyword parameter — unfiltered logs may miss critical errors. Each call MUST specify exactly one keyword. Calling without --keyword is a violation of this rule.

One-Click Diagnostic Commands

SERVICE="my-service"
CLUSTER_ID="cn-hangzhou"

# 0. [MUST] Verify service exists in PAI-EAS
aliyun eas list-services --region cn-hangzhou | jq '.Services[] | select(.ServiceName == "'$SERVICE'") | {ServiceName, Status}'

# 1. Service status
aliyun eas describe-service --cluster-id $CLUSTER_ID --service-name $SERVICE --user-agent AlibabaCloud-Agent-Skills | \
  jq '{Status, RunningInstance, TotalInstance, Message}'

# 2. Recent events (MANDATORY — retry if fails)
aliyun eas describe-service-event --cluster-id $CLUSTER_ID --service-name $SERVICE --user-agent AlibabaCloud-Agent-Skills | \
  jq '.Events[-5:] | .[] | {Time, Type, Reason, Message}' || \
  (echo "ERROR: Failed to retrieve events. Retrying..." && \
   aliyun eas describe-service-event --cluster-id $CLUSTER_ID --service-name $SERVICE --read-timeout 60 --user-agent AlibabaCloud-Agent-Skills)

# 3. Error logs — MUST call multiple times with different keywords
aliyun eas describe-service-log --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --keyword "error" --limit 30 --user-agent AlibabaCloud-Agent-Skills
aliyun eas describe-service-log --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --keyword "oom" --limit 30 --user-agent AlibabaCloud-Agent-Skills
aliyun eas describe-service-log --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --keyword "killed" --limit 30 --user-agent AlibabaCloud-Agent-Skills
aliyun eas describe-service-log --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --keyword "exit" --limit 30 --user-agent AlibabaCloud-Agent-Skills

# 4. Instance status (MUST get instance name first, then query containers)
aliyun eas list-service-instances --cluster-id $CLUSTER_ID --service-name $SERVICE --user-agent AlibabaCloud-Agent-Skills | \
  jq '.Instances[] | {InstanceId, InstanceName: .InstanceName, Status}'

# 4b. Container details (requires --instance-name from step 4)
INSTANCE_NAME="<InstanceName from step 4>"
aliyun eas list-service-containers --cluster-id $CLUSTER_ID --service-name $SERVICE \
  --instance-name $INSTANCE_NAME --user-agent AlibabaCloud-Agent-Skills

# 5. Diagnosis report
aliyun eas describe-service-diagnosis --cluster-id $CLUSTER_ID --service-name $SERVICE --user-agent AlibabaCloud-Agent-Skills

Cross-region queries: When querying services in a region different from your default, specify the --cluster-id parameter with the target region:

aliyun eas describe-service --cluster-id cn-shanghai --service-name my-service --user-agent AlibabaCloud-Agent-Skills

Quick Issue Locator

ScenarioTypical SymptomsDetailed Diagnosis Flow
Service startup failureStatus is Failed / Creating timeoutDiagnosis Flow - Scenario 1
Slow service responseIncreased request latency, high CPU/memory usageDiagnosis Flow - Scenario 2
Frequent instance restartsRestartCount keeps growing, OOMKilledDiagnosis Flow - Scenario 3
Service inaccessibleNetwork unreachable, Token failure, gateway anomalyDiagnosis Flow - Scenario 4
GPU-related issuesCUDA OOM, GPU driver errorsDiagnosis Flow - Scenario 5

Common Error Keywords

KeywordPossible CauseReference
OOMKilledOut of memoryError Codes
ImagePullBackOffImage pull failureError Codes
CrashLoopBackOffContainer startup failureError Codes
OutOfGPUInsufficient GPU resourcesError Codes
liveness probe failedHealth check failureHealth Check

Best Practices

  1. [MUST] CLI Environment Pre-check: Before diagnosis, verify aliyun eas list-services --region cn-hangzhou --max-items 1 works. If it fails, install EAS plugin first
  2. [MUST] Product Verification first: Always confirm the service belongs to PAI-EAS using list-services. NEVER use FC, ECS, EDAS, or other product APIs to diagnose EAS services
  3. [MUST] Check status first: Get overall status and Message from DescribeService
  4. [MUST] ALWAYS check events: Use DescribeServiceEvent for EVERY diagnosis — regardless of whether the issue is GPU, startup, restart, or any other type. Events are critical for understanding the timeline
  5. [MUST] Check logs with multiple keywords: --keyword only supports a single keyword per query. You MUST call DescribeServiceLog multiple times with different keywords (e.g., --keyword "error", --keyword "oom", --keyword "killed", --keyword "exit")
  6. [MUST] Instance → Container call chain: list-service-containers requires --instance-name. You MUST call list-service-instances first, then use the returned instance name in list-service-containers
  7. [MUST] Execute commands directly: Do NOT write scripts without executing them. Do NOT ask the user "should I proceed?" — just execute the diagnostic workflow autonomously
  8. [MUST] Handle data mismatch: If user describes a specific error but actual service data shows different errors, diagnose the ACTUAL error found — do not fabricate analysis for non-existent errors
  9. [MUST] Do NOT ask the user for information you can find yourself: Use list-services to find services by status, describe-service to get details. Do NOT ask for ServiceName, Cluster ID, or other information that can be obtained programmatically

API and Command Tables

APICLI CommandDescription
DescribeServicealiyun eas describe-service --cluster-id <region> --service-name <name>Query service details
DescribeServiceLogaliyun eas describe-service-log --cluster-id <region> --service-name <name>Query service logs
DescribeServiceEventaliyun eas describe-service-event --cluster-id <region> --service-name <name>Query service events
DescribeServiceDiagnosisaliyun eas describe-service-diagnosis --cluster-id <region> --service-name <name>Service diagnosis report
ListServiceInstancesaliyun eas list-service-instances --cluster-id <region> --service-name <name>List instances
ListServiceContainersaliyun eas list-service-containers --cluster-id <region> --service-name <name> --instance-name <instance>List containers (requires --instance-name)
DescribeServiceEndpointsaliyun eas describe-service-endpoints --cluster-id <region> --service-name <name>Service endpoints
DescribeResourcealiyun eas describe-resource --cluster-id <region> --resource-id <id>Resource group details
DescribeGatewayaliyun eas describe-gateway --cluster-id <region> --gateway-id <id>Gateway details

Detailed CLI command reference: Related APIs


Reference Links

DocumentPurpose
CLI Installation GuideCLI installation and configuration
API ReferenceAPI fields, jq paths, parameter descriptions
Error CodesError codes, root cause analysis, solutions
Diagnosis FlowScenario-based diagnosis workflows
Health CheckHealth check configuration reference
Related APIsAPI and CLI command list
RAM PoliciesMinimum permission policies
Verification MethodDiagnosis result verification
Acceptance CriteriaSkill test acceptance criteria