Install
openclaw skills install alibabacloud-elasticsearch-instance-diagnoseAlibaba Cloud Elasticsearch instance diagnosis skill. Use for cluster health checks, troubleshooting, and performance analysis on Elasticsearch instances. Triggers (English): Elasticsearch diagnosis, ES instance issues, slow search, write failures, cluster Red/Yellow, unassigned shards, node disconnected, load imbalance, thread pool 429, JVM/OOM/circuit breaker, disk watermark / read-only index, instance activating / change stuck, service avalanche / all shards failed. 触发词(中文): ES诊断、阿里云ES、Elasticsearch诊断、ES集群/实例故障排查、ES健康检查、集群红灯/变红/黄灯/变黄、集群异常、分片未分配、主分片未分配、节点掉线/离线、负载不均衡、搜索/查询变慢、慢查询、写入失败/变慢/拒绝、线程池打满、HTTP 429、内存过高、OOM、断路器、磁盘满/水位、索引只读、实例激活中/activating、变更卡住/未完成、雪崩、服务不可用、all shards failed。
openclaw skills install alibabacloud-elasticsearch-instance-diagnoseCollect signals from Alibaba Cloud OpenAPI (control plane) and the Elasticsearch REST API (data plane), combine them with the SOP knowledge base under references/, and produce root-cause analysis, an evidence chain, prioritized remediation guidance, and—when multiple dimensions fire—a recency-ordered incident timeline (severity vs time in window; see Timeline and recency (MUST) in §5 Step 4).
Architecture: Alibaba Cloud Elasticsearch OpenAPI + Alibaba CloudMonitor (CMS) + Elasticsearch REST API + diagnostic SOPs
Closure: If MUST applies and ES_* is set, finish authenticated ES API evidence before the final report (see Feasibility order in §5).
Pre-check: Aliyun CLI >= 3.3.1 required (for RAM permission checks and OpenAPI CLI fallback) Run
aliyun versionto verify the version is >= 3.3.1. If the CLI is missing or too old, seereferences/cli-installation-guide.md. After installation, runaliyun configure set --auto-plugin-install trueto enable automatic plugin installation (do not pass plaintext AccessKey pairs on this command line; see §1.2).
Security rules (mandatory):
- NEVER read, echo, or print AccessKey ID or AccessKey Secret values.
- NEVER prompt or ask the user to paste plaintext AccessKeys in the conversation.
- NEVER embed AccessKeys in scripts, CLI arguments, or
curlURLs.- NEVER use
aliyun configure set(or similar) to pass literal AccessKey ID/Secret on the command line.- NEVER accept AccessKeys that the user pastes into the chat, even if offered voluntarily.
- ONLY use configured CLI profiles (
aliyun configure) or environment variables such asALIBABA_CLOUD_ACCESS_KEY_ID/ALIBABA_CLOUD_ACCESS_KEY_SECRETthat the user has set in their local shell (the agent must not echo those values in the session).
⚠️ If the user provides AccessKeys in the chat (e.g. “my AK is xxx”)
- Stop immediately: do not run any Alibaba Cloud command that requires credentials.
- Decline politely and give only the names of approved configuration methods (do not repeat any secret the user may have leaked):
- Recommended: run
aliyun configurein a local terminal and enter credentials when prompted; credentials are stored in the local profile file.- Alternatively: set
ALIBABA_CLOUD_ACCESS_KEY_ID/ALIBABA_CLOUD_ACCESS_KEY_SECRETin the local shell (the user types values only in the terminal, not in chat).- Resume the diagnosis request only after credentials are configured correctly.
Verify credentials without exposing secrets:
aliyun configure list aliyun --profile <profile_name> sts get-caller-identityCredential policy:
- Prefer an
aliyun configureprofile (default or--profile).- If there is no valid identity (
configure list/get-caller-identityfails), STOP and guide the user to configure locally; do not guess or fabricate credentials.- Never pass plaintext AccessKeys through the conversation.
- NEVER ask the user to paste
ES_PASSWORDin chat; NEVER echo, print, or log the password; NEVER copy a password from chat into commands, hooks, or repo files.- Shell expansion for
curl -u "$ES_USERNAME:$ES_PASSWORD"(or equivalent) is allowed when vars are pre-exported in the user’s local shell; NEVER put the secret as a literal in chat, scripts checked into repos, or command output.- If the user tries to send a password in chat: STOP as well and ask them to set
ES_PASSWORDonly locally viaexport(see §2.2).
All control-plane and CMS data collection for this skill uses the Aliyun CLI.
[MUST]
elasticsearch/cms— plugin-mode shell only (avoid legacy CLI)
Whenever the agent emits executablealiyunlines (chat, reproducibility exports, or copy-paste steps), use plugin subcommands (lowercase-hyphenated) and kebab-case flags — the same shape asscripts/openapi_cli_collect.pyand references/verification-method.md.
- Do not use legacy POP-style invocations: a PascalCase verb immediately after
elasticsearchorcmson the samealiyunline (the old “action name = subcommand” style), or CamelCase flags like--InstanceId,--Namespace,--StartTimein new commands. Use plugin verbs only (describe-instance,describe-metric-list, …).- Naming split:
DescribeInstance,ListSearchLog,DescribeMetricList, etc. are OpenAPI action names (PascalCase — docs, RAM, console). The token afteraliyun elasticsearchoraliyun cmsin a shell must be the CLI plugin name (describe-instance,list-search-log,describe-metric-list, …).- Prefer
python3 scripts/check_es_instance_health.pyfor the standard control-plane + CMS bundle so subprocess calls stay aligned with this repo.- CLI references: Elasticsearch CLI 中心, 云监控 CLI 中心.
AI-Mode and plugin baseline (required) — wrap every diagnosis session that runs aliyun OpenAPI/CMS commands:
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update
# … diagnosis: aliyun / python3 scripts/check_es_instance_health.py …
aliyun configure ai-mode disable
configure ai-modemissing or failing: Skip the wrapper above; useALIBABA_CLOUD_USER_AGENT(next block). Log the CLI failure (e.g. subcommand unavailable). Whether the profile is valid is determined only byaliyun configure listandsts get-caller-identity— write valid / validity, not vaild.
User-Agent (required): set a User-Agent for Alibaba Cloud API calls:
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
CLI hardening (recommended): when authoring raw aliyun commands, use §2.1 MUST plugin shape first, then add --connect-timeout 3 --read-timeout 10 (increase read-timeout for large responses or CMS), consistent with the instance-management skill examples, to avoid indefinite hangs on network faults. If the global User-Agent is not set, add --user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose per invocation. For optional Elasticsearch probes inside check_es_instance_health.py (when ES_* is set), the same knobs exist as --connect-timeout / --read-timeout on that script — they map to curl for engine calls only, not to the Aliyun OpenAPI client.
Run before diagnosis:
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
curl)Have the user set connection variables in a local terminal after you confirm the Elasticsearch endpoint (VPC or public) and admin credentials—do not hardcode user-specific values in chat:
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"
Public access and
httpvshttps: FromDescribeInstance, usepublicDomain/domainand the reportedprotocol. WhenprotocolisHTTP(typical public listener), setES_ENDPOINTtohttp://<publicDomain>:9200. Usinghttps://against an HTTP-only endpoint causes TLS errors (e.g.WRONG_VERSION_NUMBER). Usehttps://only whenprotocolisHTTPS(or TLS is actually enabled on the port you use), and supply CA / fingerprint options as in HTTPS options below.If
http://“does not work” — when to tryhttps://: TreatDescribeInstanceprotocolas the source of truth for the REST listener.000, timeouts, or connection refused onhttp://usually mean network path / allowlist / security group / wrong host or port — not “try HTTPS next” whenprotocolis stillHTTP. Do switch tohttps://whenprotocolisHTTPS(or the console / product doc states TLS on that endpoint) and the failure onhttp://is a TLS or scheme symptom (e.g.WRONG_VERSION_NUMBER,error:0A00010B, immediate SSL alert while probing with the wrong scheme). IfprotocolisHTTPand only plain TCP is advertised, HTTPS is not a fallback for reachability.
Credential safety
- NEVER echo, print, or log
ES_PASSWORD; NEVER copy credentials from chat into shell history or saved files.- NEVER ask the user to paste the password in plaintext in chat.
- ONLY use the following checks to verify that variables are set:
[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET" [[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
Network connectivity and access control
Issue How to check Mitigation Public network access disabled Elasticsearch console → Network Enable public access or use the VPC endpoint Public access allowlist Console → Security → Public access allowlist Add the agent host’s public IP VPC isolation e.g. telnet <ES_IP> 9200VPC peering, Express Connect, or equivalent Security group Inbound rules on the ECS/security group hosting Elasticsearch Allow TCP 9200 (or the configured port)
Connectivity probe:
curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"— HTTP code000usually means the path is unreachable.401without-uis normal (auth required); ifES_PASSWORDis SET, proceed to authenticatedGET /_cluster/health(§7).401with-u→ wrong credentials.000/ refused / timeout → network, allowlist, or TLS/scheme mismatch.
HTTPS — prerequisites (what must be true)
- Listener: The Elasticsearch HTTP port you call (9200 unless changed) must actually speak TLS — align with
DescribeInstanceprotocol(HTTPS) or console/network documentation.- URL:
https://<host>:<port>with the same host (e.g.publicDomain) you would use for HTTP.- Client trust of the server certificate: Your client must trust the cluster’s certificate chain (cluster / cloud CA PEM, or corporate proxy CA if TLS is intercepted).
curl: prefercurl --cacert /path/to/ca.crt ...;-k/--insecureonly for short, non-production diagnosis.- Auth: Same
ES_USERNAME/ES_PASSWORDas for HTTP (Basic auth over TLS).HTTPS — how this skill documents it
- Manual
curl(§7 and es-api-call-failures.md): Add--cacert(or-kfor testing) to everycurlwhen usinghttps://if the default trust store does not include your cluster CA.check_es_instance_health.pyoptional ES probes: They invokecurlwith-uonly; they do not readES_CA_CERTS/ES_SSL_FINGERPRINT/ES_VERIFY_CERTS(those names are common for Python Elasticsearch clients). For HTTPS instances, use §7curlwith--cacertfor deep checks, or extend the script later to pass--cacertfrom an env var.- Python-style env vars (reference for other tooling):
ES_CA_CERTS,ES_SSL_FINGERPRINT,ES_VERIFY_CERTS=false(testing only) — not wired into this repo’s optionalcurlpath today.
[MUST] RAM permission pre-check
Before running this skill, verify the principal has the required RAM permissions. See
references/ram-policies.mdfor the full list. If the user reports insufficient permissions, direct them to attach the corresponding policies in the RAM console.
IMPORTANT: Parameter confirmation Confirm the following with the user before any command or API call. Do not assume undeclared defaults or hardcode user-specific parameters.
Boundary controls (MUST)
- Region and
instance-idmust not be guessed or taken from unverified defaults; if they disagree withDescribeInstanceor the user’s explicit statement, reconfirm.- Do not apply metrics, logs, or
DescribeInstanceconclusions from instance A to instance B;ES_ENDPOINTmust match the instance under diagnosis (see Pre-flight validation for Elasticsearch API below).- This skill is read-only diagnosis: do not invoke mutating control-plane APIs (create, resize, restart, delete instance, etc.). If the user requests a change, provide recommendations only; execution belongs in the console or an approved change workflow.
| Parameter | Required | Description | Default |
|---|---|---|---|
instance-id | Yes | Elasticsearch instance ID, e.g. es-cn-xxxxx. aliyun flag is --instance-id (not --InstanceId). | - |
region | Yes | Region ID (e.g. cn-hangzhou). aliyun flag is --region (not --region-id). | - |
profile | No | Aliyun CLI profile (explicit --profile recommended) | default |
ES_ENDPOINT | No | Elasticsearch endpoint (direct API access only) | - |
ES_PASSWORD | No | Elasticsearch admin password (direct API access only) | - |
--window | No | check_es_instance_health.py: analysis window in minutes (default 60) | 60 |
--connect-timeout, --read-timeout | No | check_es_instance_health.py: curl timeouts for optional ES engine probes when ES_* is set (--connect-timeout → curl --connect-timeout; --read-timeout contributes to curl -m together with connect). Defaults 5 / 10 seconds. | 5 / 10 |
Aliyun CLI shape: For
aliyun elasticsearchandaliyun cms, follow §2.1 MUST (plugin mode only) in every new executable command — do not resurrect legacyDescribeInstance/ListSearchLog-as-subcommand lines or--InstanceId-style flags in session exports or user-facing step lists (they drift fromopenapi_cli_collect.pyand fail static checks).
OpenAPI/CMS cannot replace MUST engine APIs. For any §5 MUST table row or
check_es_instance_health.pyrule-engine MUST, Alibaba Cloud OpenAPI and CloudMonitor do not replace the listed Elasticsearch REST calls for engine-level root cause—when feasibility holds, run thosecurlendpoints (see §7); they are complementary layers, not interchangeable.Feasibility is decided only by checks, not by assumption. Whether the agent may call Elasticsearch must be determined by actually running the Feasibility order (§5): at minimum verify
ES_ENDPOINT/ES_PASSWORDper §2.2, alignES_ENDPOINTwithDescribeInstance, then authenticatedGET /_cluster/health. Do not assumeES_*is unset or the path is unreachable without performing these steps in the session.
For Elasticsearch incidents, follow these four steps; each has a distinct role.
Full policy: es-api-diagnosis-strategy.md
Data-plane curl collection requires both:
ES_ENDPOINT and ES_PASSWORD are set and the network path works.For endpoints listed under a fired MUST table row or rule-engine MUST, necessity for those calls is already satisfied by the trigger—still require feasibility (Feasibility order). For optional engine
curlnot in those lists, apply feasibility and necessity per es-api-diagnosis-strategy.md.
MUST triggers (if any CMS condition below holds, collect the listed Elasticsearch evidence):
| Trigger | Scenario | Required Elasticsearch evidence |
|---|---|---|
ClusterStatus max ≥ Yellow / Red | Cluster health | allocation/explain, _cat/shards |
NodeCPUUtilization max > 80% | CPU overload | _nodes/hot_threads, _tasks |
NodeHeapMemoryUtilization max > 85% | Memory pressure | _nodes/stats/breaker, GET /_cluster/settings?include_defaults=true ( indices.breaker.* in transient / persistent ) |
Thread pool rejected > 0 | Performance | _nodes/hot_threads, _nodes/stats/thread_pool |
| Inter-node resource CV > 0.3 | Load imbalance | _cat/shards, _cat/allocation |
| Write failures or index read-only | Disk / watermark / blocks | _cluster/settings, _all/_settings?filter_path=*.settings.index.blocks, _cat/allocation |
| Intermittent Elasticsearch API timeouts + CMS CPU > 80% | Possible cascading failure | _nodes/hot_threads, _nodes/stats/thread_pool, _tasks |
Thread-pool row: interpret search vs write / bulk using sop-query-thread-pool.md vs sop-write-performance.md (see also Write-path / bulk saturation below).
Rule-engine MUST: If
check_es_instance_health.pyprints a §5 MUST / §5–§7 callout for this run, treat it like a row above—collect that listed ES evidence when feasibility holds.
Binding rule (MUST triggers): If any MUST-trigger row or the rule-engine MUST line above applies, necessity is satisfied for that evidence set—OpenAPI/CMS cannot replace those calls for engine-level root cause (cluster-health:
allocation/explain+_cat/shardsfor Yellow/Red). Confirm feasibility per Feasibility order below. If reachable with auth, run the MUST-listed endpoints in Step 2 in parallel with control-plane collection. If still blocked after authenticatedGET /_cluster/health, lead with blocking reason: unsetES_*; transport failure (000, refused, timeout); 401 with-u; scheme/TLS mismatch—not 401 on an unauthenticated probe whenES_PASSWORDis SET.
If
ThreadPool.WriteRejectedorwritepool stress matches high-QPS bulk indexing, read and followreferences/sop-write-performance.md— §2, subsection “Evidence interpretation: bulk QPS → write pool” for the evidence chain,rejectedsemantics (cumulative since node start), report ordering vs Old GC / heap (causal chain or dual P0 — write path before JVM-only headline), per-noderejected/completednumbers (reject share), per-node asymmetry, and write-only vs search. Do not lead with a JVM-only narrative when that subsection applies. For write-queue–style acceptance prompts, the opening conclusion should read as write-capacity (data-plane counters + optional CMS rule names), not only a GC/heap headline.
rejected)When
_nodes/stats/thread_poolshowssearch.rejected≫write.rejectedon the same node(s) andThreadPool.SearchRejected/ query-driven overload applies, lead the executive summary and P0 ordering withsearch(high concurrent query / terms / slow query; hot index when verified) — notwritefirst.write.rejectedmay remain P0/P1 as parallel or secondary (bulk, catch-up); Old GC / CPU / node disconnect stay co-stress or cascade. Checker listing order is not proof of narrative order — see acceptance-criteria.md §6.5 and sop-query-thread-pool.md Report narrative.Recency overrides this magnitude default when time-resolved evidence exists: do not rank the opening story by
search.rejectedvswrite.rejectedalone — cumulative counters lack timestamps. Full rubric: acceptance-criteria.md §6.5 (P0 / executive order vssearch≫write: unless write dominated by time) and §6.6 (Executive order, No false recency from counters). Binding: Timeline and recency (MUST) below (same skill).
activating / change workflow stuck (cross-layer root cause)When an instance stays in
activating, a change is unfinished, and Red or unassigned shards coexist, followreferences/sop-activating-change-stuck.mdend-to-end (MUST includesListActionRecords,DescribeInstancebefore/after remediation, collection order section 3.1, reporting section 4).
[IMPORTANT]
ES_ENDPOINTmust match the diagnosed instanceCompare
publicDomain/domainandprotocolfromDescribeInstancewithES_ENDPOINT. If they differ, warn:⚠️ ES_ENDPOINT does not match the current instance; run export ES_ENDPOINT="http://{publicDomain}:9200"whenprotocolisHTTP, orhttps://…only whenprotocolisHTTPS(adjust host/port to match the deployment).
[CRITICAL] Guide the user to fix connectivity explicitly; classify failure modes (do not default persistent timeouts to “allowlist only”). Do not imply the agent “forgot” Elasticsearch — if the first answer is CMS/OpenAPI-heavy, give the blocking reason per Feasibility order below: unset
ES_*; transport errors; 401 with valid-u; TLS/scheme—not 401 on a probe without-uwhenES_PASSWORDis SET (use authenticatedcurlfirst).
Progressive playbook (read in order): references/es-api-call-failures.md (sections 1 → 4).
MUST / strategy context: references/es-api-diagnosis-strategy.md (sections 1–3 and 3.5 summary table).
[CRITICAL] If a MUST trigger fires but data-plane evidence is missing, put a warning at the top of the report: follow section 4 of references/es-api-call-failures.md (blocking reason first, then MUST list, missing evidence; if
ES_*unset, pointer to section 2.2 of this SKILL; if vars are set, use es-api-call-failures sections 1–2 for auth vs transport).
Run the lightweight rules engine (17 metric rules) to list P0 / P1 / P2 findings and steer deeper collection:
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]
ES_* checks (password = SET only)—do not skip; never infer feasibility without this step.ES_ENDPOINT matches DescribeInstance domain / publicDomain (scheme/port).GET /_cluster/health—do not stop at 401 on an unauthenticated probe if ES_PASSWORD is SET.Based on Step 1, run collection in parallel (prioritize dimensions with signals).
If a MUST-trigger row or rule-engine MUST applies: run Feasibility order, then run that Required Elasticsearch evidence via curl in the same round (see §7). If no MUST applies, add optional data-plane curl only when feasibility and necessity both hold per the strategy doc.
Re-run check_es_instance_health.py with the same invocation pattern as Step 1; for this parallel round, --window 120 and explicit --profile <profile_name> are common.
To backfill control-plane evidence (DescribeInstance, ListSearchLog, CMS-style calls), use aliyun patterns in references/verification-method.md (epoch times, profiles, namespaces).
Note: data-plane access still requires
ES_ENDPOINT/ES_PASSWORD; the Aliyun CLI cannot replacecurlto the cluster.For MUST-trigger rows, necessity for the listed endpoints is already established—do not skip them when feasibility including reachability holds. Outside those rows, avoid unrelated bulk
curlsolely becauseES_*is set; use the strategy doc’s feasibility + necessity test instead.
Map signals to SOPs and read for deeper reasoning. With multiple signals, process P0 → P1 → P2 for severity, then apply Timeline and recency (MUST) in Step 4 so the narrative order matches when signals mattered in the window—not only static rule-engine print order.
| Observed signal | Read |
|---|---|
| Cluster Red/Yellow, node loss, pending tasks | references/sop-cluster-health.md |
Long activating, unfinished change records, Red / unassigned shards | references/sop-cluster-health.md + references/sop-activating-change-stuck.md |
| High CPU, load, imbalance | references/sop-cpu-load.md |
| Per-node load imbalance (CPU/memory/disk/shard count) | references/sop-node-load-imbalance.md |
| JVM pressure, GC, circuit breaker, OOM | references/sop-memory-gc.md |
| Disk watermark, IO, write failures (read-only) | references/sop-disk-storage.md |
| Watermark misconfiguration, index blocks, “normal” disk % but write failures | references/sop-disk-storage.md (Section 3 — watermark misconfiguration) |
| Write timeouts / rejections / latency / QPS drop | references/sop-write-performance.md |
| Query timeouts / rejections / slow queries | references/sop-query-thread-pool.md |
Nodes look down but CPU still reported; all shards failed | references/sop-service-avalanche.md |
| Intermittent Elasticsearch timeouts + CMS CPU > 80% | references/sop-service-avalanche.md |
| Risky settings, Ngram issues, API anomalies | references/sop-configuration.md |
| Event code definitions | references/health-events-catalog.md |
Acceptance-style optional checklists: references/acceptance-criteria.md §6.1–§6.6 — Red/Yellow; read-heavy CPU +
searchpool (+ CMS alignment); JVM / breakers / fielddata; write-queue vs GC +rejected/completed; read-heavy search pool vs GC-only headline (expand in sop-query-thread-pool.md Report narrative: search pool vs GC / CPU headlines); timeline/recency. Bulk/write: references/sop-write-performance.md §2. Shardreroute: references/sop-node-load-imbalance.md §1.3 (allocator / change control only).
[CRITICAL] Remediation must match the diagnosed root cause — avoid generic templates. Wrong breaker or concurrency fixes (e.g.
in_flight_requestsvsrequest, “split query” when concurrency is the issue) → seesop-memory-gc.mdand the fired signal’s SOP.
activating+ data-plane anomaly: include the one-line cross-layer root cause; seereferences/sop-activating-change-stuck.mdsection 4.
Report skeleton (copy/fill): references/report-template.md.
Problem:
check_es_instance_health.pyand P0/P1/P2 bands express severity, not when a signal mattered most within the analysis window. Cumulative engine counters (search.rejected,write.rejected) do not encode recency—write and search issues can both be “real” while only one path dominated the recent past (e.g. search pressure closer to window end than write pressure).
Binding rules for the agent:
### Incident timeline (recency-ordered) (or equivalent) block before or immediately after the executive summary, unless the user explicitly asks for a minimal report. In that block:
{begin} ~ {end}.NodeCPUUtilization, NodeHeapMemoryUtilization, GC-related metrics, ThreadPool.* if exposed as rates or non-cumulative series in the collected JSON.ListSearchLog: correlate query vs index slow entries to minutes._nodes/stats/thread_pool samples at known times to show delta on rejected / completed; or _tasks / hot_threads for current skew vs historical cumulative counters.Use the same check_es_instance_health.py command as §5 Step 1 (optional --window / --profile; default window 60 minutes if omitted).
check_es_instance_health.py accepts external JSON to avoid duplicate calls:
python3 scripts/check_es_instance_health.py \
-i <InstanceId> -r <RegionId> \
--data-source input \
--input-json-file /path/to/diag-input.json
Input JSON shape:
{
"status_info": {},
"metrics": {},
"events": [],
"logs": []
}
--data-source modes:
auto: prefer injected fields; backfill gaps via Aliyun CLI.cli: ignore injection; fetch everything via CLI.input: injection only; no OpenAPI calls.For additional OpenAPI examples, see references/verification-method.md.
When feasibility holds (including reachability), execute the REST calls required by any MUST-trigger row (§5). For endpoints not listed in a fired MUST row, call them only when feasibility and necessity both hold per the strategy doc.
ES_ENDPOINTmay behost:portor a full URL. For the samples below, normalize tohttp://${ES_ENDPOINT#http://}(usehttps://consistently when the cluster serves TLS).Timeouts: every
curlmust use--connect-timeout 10 --max-time 30.
Scope: The cluster-health MUST row uses ClusterStatus max ≥ Yellow (includes Red). Use this set for unassigned / misallocated shard root cause on the engine.
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/health?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
-H "Content-Type: application/json" \
-X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
-d '{}'
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"
Include
_cluster/settingswhen heap / GC / breaker rules fired in Step 1 or_nodes/stats/breakershows concern — read transient and persistentindices.breaker.*/network.breaker.*.
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"
/_cluster/pending_tasksandGET /_nodes/stats/thread_poolare also listed under Red / Yellow (MUST) above—one call each per session when both sections apply. If you run only this performance block, add those twocurllines from that block.
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"
GET /_cluster/settings?include_defaults=truealso appears under Query / write performance (MUST) above—reuse one response when both blocks apply. If you run only this SHOULD block, add the samecurlline from that block.
Protocol sanity (avoid WRONG_VERSION_NUMBER): usually http/https scheme mismatch on ES_ENDPOINT — fix scheme/port and retry.
Scenario → endpoint index: references/es-api-catalog.md.
The knowledge base covers 48+ health-event-style rules and chained scenarios (e.g. disk pressure → allocation → Red). Per-category counts, P0/P1/P2 mix, and event codes: references/health-events-catalog.md — scenario runbooks: references/sop-*.md (index: references/README.md).
Read-only: no mutating control-plane APIs; no teardown.
rejected: cumulative unless you show a delta — sop-query-thread-pool.md §1–2; write/bulk: sop-write-performance.md §2.references/verification-method.md — Verification (how to validate diagnosis; metrics, APIs, workflows)references/report-template.md — Structured diagnosis report skeletonreferences/README.md — Language map (reference assets and sop-*.md runbooks; English in this repo)references/ram-policies.md — RAM policy listreferences/acceptance-criteria.md — Correct/incorrect patterns and acceptance (includes credential and safety anti-patterns)references/cli-installation-guide.md — Aliyun CLI installationreferences/es-api-catalog.md — Elasticsearch REST API catalogreferences/health-events-catalog.md — Health event catalogreferences/sop-*.md — Scenario SOPs (e.g. sop-activating-change-stuck.md for activating / change stuck, cross-layer root cause)references/es-api-diagnosis-strategy.md — Elasticsearch API diagnosis strategy