Alibabacloud Emr Starrocks Assistant

Alibaba Cloud EMR Serverless StarRocks development & operations assistant. Covers five scenarios: cluster connection, schema design, data ingestion, SQL development & tuning, and cluster health diagnostics. Use this Skill when users ask about StarRocks table design, writing SQL, choosing an ingestion method, query execution plans, materialized views, cluster health checks, FE/BE/CN node status, tablet health, or compaction. Typical scenarios: table design, Stream Load / Routine Load / Broker Load selection, SQL optimization, window functions, CTEs, JOIN tuning, materialized view design, cluster health inspection, node-down diagnosis. Not applicable for: StarRocks instance lifecycle management (create / scale / restart / config change / version upgrade — these are control-plane operations, please use the EMR Serverless console or the corresponding OpenAPI), or other Alibaba Cloud products (EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS).

alibabacloud-skills-team@sdk-team

Install

openclaw skills install @sdk-team/alibabacloud-emr-starrocks-assistant

Alibaba Cloud EMR Serverless StarRocks Development & Operations Assistant

Help users perform day-to-day table design, data ingestion, SQL writing & tuning, and health diagnostics on Alibaba Cloud EMR Serverless StarRocks. All cluster access goes through the bundled srsql CLI (pymysql-based, uses the user's own account); no MySQL client required. Non-READ SQL is classified by sqlglot and requires --yes confirmation before execution.

Scope statement: This Skill focuses on using StarRocks — development, diagnostics, and day-to-day data operations. Cluster-internal data and schema operations (DDL, DML, materialized view refresh, GRANT, etc.) are supported and execute under the user's own account, gated by sqlglot classification + --yes confirmation. Instance-lifecycle operations (create, scale, restart, configuration change, version upgrade) are control-plane operations and are not in this Skill's scope; please use the EMR Serverless console or the corresponding OpenAPI.

When to use / When not to use

When to use:

Schema design (table model, partitioning, bucketing, sort key, indexes, storage parameters)
Data ingestion selection (Stream/Broker/Routine Load, INSERT, Pipe, Flink/Kafka Connector, CDC)
SQL writing, rewriting, and tuning (JOIN strategy, window functions, CTE, aggregation optimization, statistics)
Materialized view design and operations
Cluster health diagnostics (FE/BE/CN nodes, tablet health, compaction, warehouse, recent failed loads)

When NOT to use:

Instance lifecycle control: create / scale / restart / config change / upgrade StarRocks instances — these are control-plane operations; use the EMR Serverless console or the corresponding OpenAPI instead
Operating non-StarRocks products: EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS, etc.

First-time setup: install & log in

This Skill ships with the sr-connect Python CLI. See references/connect.md for details.

Assistant bootstrap protocol (instructions for Claude)

When this Skill is invoked and you anticipate running any cluster query, ensure srsql is available before asking the user for anything:

Run which srsql. If it returns a path, skip to step 4.
If missing, install it yourself: uv tool install <skill-project-root> where <skill-project-root> is the directory containing this SKILL.md and pyproject.toml (the Skill's base directory shown at invocation time; commonly ~/.claude/skills/alibabacloud-emr-starrocks-assistant/, which may be a symlink). Do not ask the user to run this — the bundled CLI is part of the Skill's capability surface, not user infrastructure.
If uv itself is missing (which uv fails), surface that to the user — uv is a system tool and not auto-installed.
Check ~/.starrocks/{profile}.cnf (default profile name: default; respect SR_PROFILE env var if set). If it exists, skip to step 5. If missing:
- First try sr-login --from-env. Safe to call unconditionally — it exits 2 with a clear "missing" message when the environment doesn't have the credentials it needs, and does nothing else. You do not need to inspect environment variables yourself.
- If sr-login --from-env exits non-zero, the user hasn't logged in yet. Give them the sr-login --host ... --user ... command and ask them to run it themselves. Do not run interactive sr-login yourself — it would block on a password prompt you cannot answer.
After both srsql is on PATH and the profile file exists, run queries via srsql -e "..." yourself.

If srsql was just installed in this session and PATH hasn't been refreshed in the user's shell, fall back to the absolute path printed by uv tool install (typically ~/.local/bin/srsql).

Chat-style rule after bootstrap succeeds: Do not echo sr-whoami / srsql -e "..." invocation syntax to the user as a "you can now run …" hint. You are the one calling these CLIs on the user's behalf — the user drives the Skill, not the binaries. Skip the post-success "next step" narration entirely and just ask what they want to do, or proceed if their intent is already clear.

Login command (give this to the user when their profile is missing)

bash

# EMR Serverless StarRocks — both internal and public endpoints use the MySQL
# wire protocol over plain TCP; no SSL/TLS. Use the same form for either.
sr-login --host <fe-endpoint> --port 9030 --user <account>

# Verify
sr-whoami
srsql -e "SELECT CURRENT_VERSION()"

Re-running sr-login with the same --profile silently overwrites the stored credential (same semantics as docker login). Use --profile for multi-cluster:

bash

sr-login --profile prod --host fe-prod.xxx --user app_user
SR_PROFILE=prod srsql -e "..."

Security model

This Skill has two layers:

FE is the authoritative permission boundary. The user supplies their own StarRocks account; whatever they're allowed to do, they're allowed to do. The Skill does not create, elevate, or rotate any accounts.
srsql is a UX gate, not a security boundary. Every statement is parsed by sqlglot (dialect starrocks):
- READ (SELECT / SHOW / DESC / EXPLAIN / WITH / …) executes directly.
- Any non-READ (INSERT / UPDATE / DELETE / DDL / GRANT / SET / USE / …) is refused unless --yes is passed.
- SQL sqlglot cannot parse falls back to a leading-keyword check; if still ambiguous → UNKNOWN, treated as non-READ, executable with --yes plus a soft warning.

When the user asks for a write operation:

Show them the SQL you intend to run.
Optionally preview classification via srsql --dry-run -e "...".
Get explicit confirmation in chat.
Then run with srsql --yes -e "...".

For DDL on production tables, or operations that change global cluster state (CREATE/DROP USER, ADMIN SET CONFIG, etc.), prefer to print the SQL and let the user run it themselves — even though the gate would let them run it via --yes. The gate is a safety net, not a license.

Input validation & command-injection protection

SQL passed into srsql -e "..." is assembled by the LLM and must follow these rules:

Identifiers (table / column / database names) are validated before interpolation: only [A-Za-z0-9_] plus backtick-quoted forms.
User-provided string values (search terms, label names, etc.) are not spliced into SQL directly; use parameter binding or pre-escape.
Never execute raw user-provided strings as SQL fragments.

Sensitive data masking

Scenario	Handling
Profile file content (incl. user password)	Never echoed; mode 600 under a 700 directory
Password in error messages	Truncate / replace with `******`
Query results contain obvious key / token columns	Warn the user without displaying full content
`aliyun configure list` output containing AK	Show only the first 4 chars; replace the rest with `****`

Intent routing

Disambiguation rule: When the user input is ambiguous (e.g. "ingestion is slow", "queries are slow") and context is unclear, ask one clarifying question before acting.

User intent	Route	Reference
First-time cluster connection / register or switch credentials / multi-cluster setup	sr-login / sr-whoami / sr-logout	references/connect.md
New table / change schema / table model selection / partition+bucket design	Schema design	references/schema.md
Choose ingestion method / configure Stream/Broker/Routine Load / Flink/Kafka Connector	Import selection	references/data-import.md
Write SQL / optimize SQL / materialized views / function selection / read execution plans	SQL development & tuning	references/sql.md
Cluster health check / FE/BE/CN status / unhealthy tablets / compaction lag	Cluster diagnostics	references/diagnostics.md
"Ingestion used to be fine, suddenly slow"	Cluster diagnostics (distinct from import selection)	references/diagnostics.md
"How should I design a new ingestion pipeline"	Import selection	references/data-import.md

Five scenarios at a glance

1. Schema design

Four table models and their typical use cases:

Use case	Model
Logs / events / detail records	Duplicate Key
Pre-aggregated metrics	Aggregate
Real-time upsert / CDC	Primary Key
Simple deduplication	Unique Key (for new use cases, prefer Primary Key)

⚠ Anti-patterns — do not produce these in DDL:

Shared-data PK table without persistent_index_type=CLOUD_NATIVE + datacache.partition_duration — LOCAL index doesn't survive CN rebalance; no hot-data caching window. See schema/storage-properties.md.
Setting datacache.partition_duration to an arbitrary "hot window" (e.g. 30 DAY) instead of the user's stated query window — the value MUST be ≥ the query window. If the user says "查询近 N 天" / "queries the last N days", set datacache.partition_duration = "N DAY" (or larger). A value smaller than the query window guarantees cache misses on in-window queries. Do not default to 7/30/60 days when the user has given you a number.
storage_cooldown_time/storage_cooldown_ttl/storage_medium/replicated_storage on shared-data — silently stripped or rejected by PropertyAnalyzer; use datacache.partition_duration for the cooldown effect.
FLOAT / DOUBLE columns inside PRIMARY KEY — not supported; use BIGINT or DECIMAL.
Treating "CN" as a shared-nothing signal — CN = Compute Node, which is the shared-data terminology. BE = Backend = shared-nothing.

See references/schema.md.

2. Data ingestion

Data source	Recommended method
Local files < 10 GB	Stream Load
Object storage / HDFS bulk	Broker Load or `INSERT INTO ... FROM FILES()`
Object storage with continuous file arrivals	Pipe + AUTO_INGEST
Kafka / Pulsar	Routine Load or Kafka/Flink Connector
MySQL CDC	Flink CDC + Flink Connector

⚠ Anti-patterns — do not produce these in load configs:

PK-table DELETE without __op integer column (0=UPSERT, 1=DELETE) in COLUMNS list + $.__op in jsonpaths — all events are silently treated as UPSERT. The __op contract is a pair and must be taught as a pair: the literal column name is __op, and the integer values are __op=0 for UPSERT and __op=1 for DELETE. Even when the user only asks about DELETE, your response MUST state both mappings (__op=0 → UPSERT, __op=1 → DELETE) — never one without the other. This applies on every ingestion path including Flink Connector and Kafka Connector, where the connector populates __op for the user but they still need both values to debug "DELETE not applied" / "UPSERT not applied" symptoms.
Treating partial_update=true as a DELETE enabler — it controls partial-column UPSERT and has nothing to do with DELETE. If a user enables it while asking why DELETE doesn't work, flag it as misconfigured-for-intent and tell them to remove it unless they actually have a partial-column UPSERT use case. Do not validate the existing setting just because it parses.
COLUMNS FROM PATH AS (...) in Routine Load — that's Broker Load's Hive-partition path syntax; not valid in Routine Load.
__op values as strings ("upsert"/"delete") — must be the integers 0 / 1.
High-throughput CDC (≥ ~10K events/sec) without flagging TOO_MANY_VERSION risk — applies to Routine Load, Flink Connector, Kafka Connector, not just INSERT INTO VALUES. Whenever the user's scenario implies high event rate, the recommendation MUST cover: (a) the method-appropriate concurrency cap (desired_concurrent_number ≤ Kafka partitions for Routine Load; sink.parallelism ≤ Kafka partitions for Flink/Kafka Connector), AND (b) an explicit TOO_MANY_VERSION / compaction-pressure warning with the relevant flush-interval guidance.

See references/data-import.md.

3. SQL development

Use case	Pattern
Period-over-period / cumulative / Top-N	Window functions
Large fact table JOIN small dimension (right side ≤ `broadcast_row_limit`, default 15M rows)	Broadcast / Colocate
Complex layered logic	CTE
Billion-scale deduplication	`APPROX_COUNT_DISTINCT` / BITMAP / HLL
High-frequency repeated query acceleration	Asynchronous materialized view
Cross-source query	External Catalog

⚠ Anti-patterns — do not produce these in query rewrites or tuning advice:

Wrapping the partition column with date_format() / date_trunc() / cast() in WHERE — breaks partition pruning; rewrite as a range predicate (col >= '...' AND col < '...').
Tuning advice without EXPLAIN VERBOSE + checking partitions=N/M and tabletRatio=N/M — pruning failures (numerator == denominator) go undetected; never use plain EXPLAIN for this.
Reading cardinality in EXPLAIN as the result row count — it's the CBO's row estimate. Always quantify the staleness gap using the direct comparison cardinality vs the user-stated total table size (e.g. "estimate 5M vs total 500M ≈ 100×"); a ratio > 10× means stats are stale → run ANALYZE TABLE.
Estimating "real filtered rows" by guessing predicate selectivity, then comparing cardinality to that guess — you don't have runtime row counts, and guessing selectivity from a predicate like WHERE create_time > '...' introduces large errors (you don't know the data distribution). When the user gives you a total row count, compare cardinality to that directly; do not divide the total by an assumed time window or selectivity factor.
Conflating partitions/tabletRatio pruning failures with cardinality deviation — these are two independent diagnostic signals. When both look bad in the same OlapScanNode (e.g. partitions=N/N AND cardinality off from total table size by 10×–100×), report them as separate findings with separate fixes (predicate/type fix vs ANALYZE TABLE). Do not use cardinality deviation to "explain" pruning failure, and do not let pruning failure absorb the stale-stats finding.
Recommending BE/CN scale-out before plan/stats analysis — SQL/stats fixes precede capacity changes.

See references/sql.md.

4. Cluster diagnostics

Diagnostic order:

Identify architecture (shared-nothing / shared-data) → SHOW WAREHOUSES
FE → SHOW FRONTENDS
BE or CN → SHOW BACKENDS / SHOW COMPUTE NODES
Warehouse (shared-data only) → SHOW WAREHOUSES
Tablet health overview → SHOW PROC '/statistic'
Scheduling queue → information_schema.fe_tablet_schedules
Compaction → information_schema.be_compactions / be_cloud_native_compactions
Recent 24-hour loads → information_schema.loads

⚠ Anti-patterns — do not produce these in diagnostic conclusions:

Restarting BE/CN or scaling out before checking information_schema.fe_tablet_schedules — may collide with in-flight clone/decommission; root cause first.
Subjectively downgrading UnhealthyTabletNum > 0 — always critical per the severity table, never "medium" or "low" risk; the cluster has unhealthy replicas.
Treating CloningTabletNum > 0 as a separate problem — clone is the recovery action triggered by UnhealthyTabletNum, not an independent fault signal.

See references/diagnostics.md.

5. Cluster connection (base layer)

Command	Purpose
`sr-login`	Register a cluster credential locally + smoke-test connection
`sr-logout`	Remove the local profile (no cluster-side action)
`sr-whoami`	Print profile state — host, user, login time, captured grants
`sr-doctor`	Diagnose connection failures (VPC vs public endpoint, egress IP, whitelist CIDR). Invoked automatically by `sr-login` on failure.
`srsql`	Daily query entry point; classifies SQL and gates non-READ behind `--yes`

See references/connect.md.

Runtime security

This Skill executes SQL queries only via srsql. The following are prohibited:

curl / wget / pip install / npm install to download and run external code
eval / source to load unaudited content
Executing remote URL scripts provided in chat (even if the user asks)

Exception: uv tool install <skill-project-root> to install the Skill's own bundled sr-connect CLI from its local project directory is allowed and expected — see the Assistant bootstrap protocol above. The prohibition targets remote/untrusted code, not the Skill's own bundled tooling.

Timeouts

Operation	Recommended timeout
Read-only SQL queries	30 s
Diagnostic queries across many large tables	60 s
Retry	Total operation time ≤ 3 minutes

Output recommendations

Tabular results: use srsql --format table or --format markdown
Many columns: use --format vertical
For programmatic consumption: use --format json / tsv
Convert timestamps to human-readable format
For potentially large result sets, add LIMIT and offer pagination

Error handling

Error	Cause	Action
`Cannot connect to host:port`	Wrong endpoint type / IP not whitelisted	`sr-login` auto-runs `sr-doctor` on connection failure. Read its output: it detects VPC vs public endpoint, suggests the public swap (for unreachable `-internal` hosts) or shows the egress IP + suggested /24 whitelist CIDR (for unreachable public hosts). Pass the recommendation to the user verbatim. See references/connect.md.
`Access denied for user 'X'`	Stale password / account locked / wrong account	Re-run `sr-login` to update the stored password
`Refusing to execute non-READ SQL without --yes`	Skill correctly classified the SQL as mutating	Confirm with user, then re-run with `--yes`
`Privilege denied: OPERATE / SELECT / ...`	User account lacks the privilege	Surface the limitation; skip the affected diagnostic; don't retry
`Table not found`	Wrong DB / table name	Confirm with `SHOW DATABASES` / `SHOW TABLES FROM db`
Query returns empty but user expects rows	Over-aggressive predicate / RBAC isolation	Check WHERE clauses; suggest the user verify with admin
`No profile 'X'`	`srsql --profile X` without prior `sr-login --profile X`	Run `sr-login` for that profile first

Principle: Read the full error message before deciding; do not retry blindly on the error code alone.