Alibabacloud Emr Starrocks Assistant

MCP Tools

Alibaba Cloud EMR Serverless StarRocks development & operations assistant. Covers five scenarios: cluster connection, schema design, data ingestion, SQL development & tuning, and cluster health diagnostics. Use this Skill when users ask about StarRocks table design, writing SQL, choosing an ingestion method, query execution plans, materialized views, cluster health checks, FE/BE/CN node status, tablet health, or compaction. Typical scenarios: table design, Stream Load / Routine Load / Broker Load selection, SQL optimization, window functions, CTEs, JOIN tuning, materialized view design, cluster health inspection, node-down diagnosis. Not applicable for: StarRocks instance lifecycle management (create / scale / restart / config change / version upgrade — these are control-plane operations, please use the EMR Serverless console or the corresponding OpenAPI), or other Alibaba Cloud products (EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS).

Audits

Pass

Install

openclaw skills install alibabacloud-emr-starrocks-assistant

Alibaba Cloud EMR Serverless StarRocks Development & Operations Assistant

Help users perform day-to-day table design, data ingestion, SQL writing & tuning, and health diagnostics on Alibaba Cloud EMR Serverless StarRocks. All cluster access goes through the bundled srsql CLI (pymysql-based, uses the user's own account); no MySQL client required. Non-READ SQL is classified by sqlglot and requires --yes confirmation before execution.

Scope statement: This Skill focuses on using StarRocks — development, diagnostics, and day-to-day data operations. Cluster-internal data and schema operations (DDL, DML, materialized view refresh, GRANT, etc.) are supported and execute under the user's own account, gated by sqlglot classification + --yes confirmation. Instance-lifecycle operations (create, scale, restart, configuration change, version upgrade) are control-plane operations and are not in this Skill's scope; please use the EMR Serverless console or the corresponding OpenAPI.

When to use / When not to use

When to use:

  • Schema design (table model, partitioning, bucketing, sort key, indexes, storage parameters)
  • Data ingestion selection (Stream/Broker/Routine Load, INSERT, Pipe, Flink/Kafka Connector, CDC)
  • SQL writing, rewriting, and tuning (JOIN strategy, window functions, CTE, aggregation optimization, statistics)
  • Materialized view design and operations
  • Cluster health diagnostics (FE/BE/CN nodes, tablet health, compaction, warehouse, recent failed loads)

When NOT to use:

  • Instance lifecycle control: create / scale / restart / config change / upgrade StarRocks instances — these are control-plane operations; use the EMR Serverless console or the corresponding OpenAPI instead
  • Operating non-StarRocks products: EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS, etc.

First-time setup: install & log in

This Skill ships with the sr-connect Python CLI. See references/connect.md for details.

Assistant bootstrap protocol (instructions for Claude)

When this Skill is invoked and you anticipate running any cluster query, ensure srsql is available before asking the user for anything:

  1. Run which srsql. If it returns a path, skip to step 4.
  2. If missing, install it yourself: uv tool install <skill-project-root> where <skill-project-root> is the directory containing this SKILL.md and pyproject.toml (the Skill's base directory shown at invocation time; commonly ~/.claude/skills/alibabacloud-emr-starrocks-assistant/, which may be a symlink). Do not ask the user to run this — the bundled CLI is part of the Skill's capability surface, not user infrastructure.
  3. If uv itself is missing (which uv fails), surface that to the user — uv is a system tool and not auto-installed.
  4. Check ~/.starrocks/{profile}.cnf (default profile name: default; respect SR_PROFILE env var if set). If it exists, skip to step 5. If missing:
    • First try sr-login --from-env. Safe to call unconditionally — it exits 2 with a clear "missing" message when the environment doesn't have the credentials it needs, and does nothing else. You do not need to inspect environment variables yourself.
    • If sr-login --from-env exits non-zero, the user hasn't logged in yet. Give them the sr-login --host ... --user ... command and ask them to run it themselves. Do not run interactive sr-login yourself — it would block on a password prompt you cannot answer.
  5. After both srsql is on PATH and the profile file exists, run queries via srsql -e "..." yourself.

If srsql was just installed in this session and PATH hasn't been refreshed in the user's shell, fall back to the absolute path printed by uv tool install (typically ~/.local/bin/srsql).

Chat-style rule after bootstrap succeeds: Do not echo sr-whoami / srsql -e "..." invocation syntax to the user as a "you can now run …" hint. You are the one calling these CLIs on the user's behalf — the user drives the Skill, not the binaries. Skip the post-success "next step" narration entirely and just ask what they want to do, or proceed if their intent is already clear.

Login command (give this to the user when their profile is missing)

# EMR Serverless StarRocks — both internal and public endpoints use the MySQL
# wire protocol over plain TCP; no SSL/TLS. Use the same form for either.
sr-login --host <fe-endpoint> --port 9030 --user <account>

# Verify
sr-whoami
srsql -e "SELECT CURRENT_VERSION()"

Re-running sr-login with the same --profile silently overwrites the stored credential (same semantics as docker login). Use --profile for multi-cluster:

sr-login --profile prod --host fe-prod.xxx --user app_user
SR_PROFILE=prod srsql -e "..."

Security model

This Skill has two layers:

  1. FE is the authoritative permission boundary. The user supplies their own StarRocks account; whatever they're allowed to do, they're allowed to do. The Skill does not create, elevate, or rotate any accounts.
  2. srsql is a UX gate, not a security boundary. Every statement is parsed by sqlglot (dialect starrocks):
    • READ (SELECT / SHOW / DESC / EXPLAIN / WITH / …) executes directly.
    • Any non-READ (INSERT / UPDATE / DELETE / DDL / GRANT / SET / USE / …) is refused unless --yes is passed.
    • SQL sqlglot cannot parse falls back to a leading-keyword check; if still ambiguous → UNKNOWN, treated as non-READ, executable with --yes plus a soft warning.

When the user asks for a write operation:

  1. Show them the SQL you intend to run.
  2. Optionally preview classification via srsql --dry-run -e "...".
  3. Get explicit confirmation in chat.
  4. Then run with srsql --yes -e "...".

For DDL on production tables, or operations that change global cluster state (CREATE/DROP USER, ADMIN SET CONFIG, etc.), prefer to print the SQL and let the user run it themselves — even though the gate would let them run it via --yes. The gate is a safety net, not a license.

Input validation & command-injection protection

SQL passed into srsql -e "..." is assembled by the LLM and must follow these rules:

  1. Identifiers (table / column / database names) are validated before interpolation: only [A-Za-z0-9_] plus backtick-quoted forms.
  2. User-provided string values (search terms, label names, etc.) are not spliced into SQL directly; use parameter binding or pre-escape.
  3. Never execute raw user-provided strings as SQL fragments.

Sensitive data masking

ScenarioHandling
Profile file content (incl. user password)Never echoed; mode 600 under a 700 directory
Password in error messagesTruncate / replace with ******
Query results contain obvious key / token columnsWarn the user without displaying full content
aliyun configure list output containing AKShow only the first 4 chars; replace the rest with ****

Intent routing

Disambiguation rule: When the user input is ambiguous (e.g. "ingestion is slow", "queries are slow") and context is unclear, ask one clarifying question before acting.

User intentRouteReference
First-time cluster connection / register or switch credentials / multi-cluster setupsr-login / sr-whoami / sr-logoutreferences/connect.md
New table / change schema / table model selection / partition+bucket designSchema designreferences/schema.md
Choose ingestion method / configure Stream/Broker/Routine Load / Flink/Kafka ConnectorImport selectionreferences/data-import.md
Write SQL / optimize SQL / materialized views / function selection / read execution plansSQL development & tuningreferences/sql.md
Cluster health check / FE/BE/CN status / unhealthy tablets / compaction lagCluster diagnosticsreferences/diagnostics.md
"Ingestion used to be fine, suddenly slow"Cluster diagnostics (distinct from import selection)references/diagnostics.md
"How should I design a new ingestion pipeline"Import selectionreferences/data-import.md

Five scenarios at a glance

1. Schema design

Four table models and their typical use cases:

Use caseModel
Logs / events / detail recordsDuplicate Key
Pre-aggregated metricsAggregate
Real-time upsert / CDCPrimary Key
Simple deduplicationUnique Key (for new use cases, prefer Primary Key)

⚠ Anti-patterns — do not produce these in DDL:

  • Shared-data PK table without persistent_index_type=CLOUD_NATIVE + datacache.partition_duration — LOCAL index doesn't survive CN rebalance; no hot-data caching window. See schema/storage-properties.md.
  • Setting datacache.partition_duration to an arbitrary "hot window" (e.g. 30 DAY) instead of the user's stated query window — the value MUST be ≥ the query window. If the user says "查询近 N 天" / "queries the last N days", set datacache.partition_duration = "N DAY" (or larger). A value smaller than the query window guarantees cache misses on in-window queries. Do not default to 7/30/60 days when the user has given you a number.
  • storage_cooldown_time/storage_cooldown_ttl/storage_medium/replicated_storage on shared-data — silently stripped or rejected by PropertyAnalyzer; use datacache.partition_duration for the cooldown effect.
  • FLOAT / DOUBLE columns inside PRIMARY KEY — not supported; use BIGINT or DECIMAL.
  • Treating "CN" as a shared-nothing signal — CN = Compute Node, which is the shared-data terminology. BE = Backend = shared-nothing.

See references/schema.md.

2. Data ingestion

Data sourceRecommended method
Local files < 10 GBStream Load
Object storage / HDFS bulkBroker Load or INSERT INTO ... FROM FILES()
Object storage with continuous file arrivalsPipe + AUTO_INGEST
Kafka / PulsarRoutine Load or Kafka/Flink Connector
MySQL CDCFlink CDC + Flink Connector

⚠ Anti-patterns — do not produce these in load configs:

  • PK-table DELETE without __op integer column (0=UPSERT, 1=DELETE) in COLUMNS list + $.__op in jsonpaths — all events are silently treated as UPSERT. The __op contract is a pair and must be taught as a pair: the literal column name is __op, and the integer values are __op=0 for UPSERT and __op=1 for DELETE. Even when the user only asks about DELETE, your response MUST state both mappings (__op=0 → UPSERT, __op=1 → DELETE) — never one without the other. This applies on every ingestion path including Flink Connector and Kafka Connector, where the connector populates __op for the user but they still need both values to debug "DELETE not applied" / "UPSERT not applied" symptoms.
  • Treating partial_update=true as a DELETE enabler — it controls partial-column UPSERT and has nothing to do with DELETE. If a user enables it while asking why DELETE doesn't work, flag it as misconfigured-for-intent and tell them to remove it unless they actually have a partial-column UPSERT use case. Do not validate the existing setting just because it parses.
  • COLUMNS FROM PATH AS (...) in Routine Load — that's Broker Load's Hive-partition path syntax; not valid in Routine Load.
  • __op values as strings ("upsert"/"delete") — must be the integers 0 / 1.
  • High-throughput CDC (≥ ~10K events/sec) without flagging TOO_MANY_VERSION risk — applies to Routine Load, Flink Connector, Kafka Connector, not just INSERT INTO VALUES. Whenever the user's scenario implies high event rate, the recommendation MUST cover: (a) the method-appropriate concurrency cap (desired_concurrent_number ≤ Kafka partitions for Routine Load; sink.parallelism ≤ Kafka partitions for Flink/Kafka Connector), AND (b) an explicit TOO_MANY_VERSION / compaction-pressure warning with the relevant flush-interval guidance.

See references/data-import.md.

3. SQL development

Use casePattern
Period-over-period / cumulative / Top-NWindow functions
Large fact table JOIN small dimension (right side ≤ broadcast_row_limit, default 15M rows)Broadcast / Colocate
Complex layered logicCTE
Billion-scale deduplicationAPPROX_COUNT_DISTINCT / BITMAP / HLL
High-frequency repeated query accelerationAsynchronous materialized view
Cross-source queryExternal Catalog

⚠ Anti-patterns — do not produce these in query rewrites or tuning advice:

  • Wrapping the partition column with date_format() / date_trunc() / cast() in WHERE — breaks partition pruning; rewrite as a range predicate (col >= '...' AND col < '...').
  • Tuning advice without EXPLAIN VERBOSE + checking partitions=N/M and tabletRatio=N/M — pruning failures (numerator == denominator) go undetected; never use plain EXPLAIN for this.
  • Reading cardinality in EXPLAIN as the result row count — it's the CBO's row estimate. Always quantify the staleness gap using the direct comparison cardinality vs the user-stated total table size (e.g. "estimate 5M vs total 500M ≈ 100×"); a ratio > 10× means stats are stale → run ANALYZE TABLE.
  • Estimating "real filtered rows" by guessing predicate selectivity, then comparing cardinality to that guess — you don't have runtime row counts, and guessing selectivity from a predicate like WHERE create_time > '...' introduces large errors (you don't know the data distribution). When the user gives you a total row count, compare cardinality to that directly; do not divide the total by an assumed time window or selectivity factor.
  • Conflating partitions/tabletRatio pruning failures with cardinality deviation — these are two independent diagnostic signals. When both look bad in the same OlapScanNode (e.g. partitions=N/N AND cardinality off from total table size by 10×–100×), report them as separate findings with separate fixes (predicate/type fix vs ANALYZE TABLE). Do not use cardinality deviation to "explain" pruning failure, and do not let pruning failure absorb the stale-stats finding.
  • Recommending BE/CN scale-out before plan/stats analysis — SQL/stats fixes precede capacity changes.

See references/sql.md.

4. Cluster diagnostics

Diagnostic order:

  1. Identify architecture (shared-nothing / shared-data) → SHOW WAREHOUSES
  2. FE → SHOW FRONTENDS
  3. BE or CN → SHOW BACKENDS / SHOW COMPUTE NODES
  4. Warehouse (shared-data only) → SHOW WAREHOUSES
  5. Tablet health overview → SHOW PROC '/statistic'
  6. Scheduling queue → information_schema.fe_tablet_schedules
  7. Compaction → information_schema.be_compactions / be_cloud_native_compactions
  8. Recent 24-hour loads → information_schema.loads

⚠ Anti-patterns — do not produce these in diagnostic conclusions:

  • Restarting BE/CN or scaling out before checking information_schema.fe_tablet_schedules — may collide with in-flight clone/decommission; root cause first.
  • Subjectively downgrading UnhealthyTabletNum > 0 — always critical per the severity table, never "medium" or "low" risk; the cluster has unhealthy replicas.
  • Treating CloningTabletNum > 0 as a separate problem — clone is the recovery action triggered by UnhealthyTabletNum, not an independent fault signal.

See references/diagnostics.md.

5. Cluster connection (base layer)

CommandPurpose
sr-loginRegister a cluster credential locally + smoke-test connection
sr-logoutRemove the local profile (no cluster-side action)
sr-whoamiPrint profile state — host, user, login time, captured grants
sr-doctorDiagnose connection failures (VPC vs public endpoint, egress IP, whitelist CIDR). Invoked automatically by sr-login on failure.
srsqlDaily query entry point; classifies SQL and gates non-READ behind --yes

See references/connect.md.

Runtime security

This Skill executes SQL queries only via srsql. The following are prohibited:

  • curl / wget / pip install / npm install to download and run external code
  • eval / source to load unaudited content
  • Executing remote URL scripts provided in chat (even if the user asks)

Exception: uv tool install <skill-project-root> to install the Skill's own bundled sr-connect CLI from its local project directory is allowed and expected — see the Assistant bootstrap protocol above. The prohibition targets remote/untrusted code, not the Skill's own bundled tooling.

Timeouts

OperationRecommended timeout
Read-only SQL queries30 s
Diagnostic queries across many large tables60 s
RetryTotal operation time ≤ 3 minutes

Output recommendations

  • Tabular results: use srsql --format table or --format markdown
  • Many columns: use --format vertical
  • For programmatic consumption: use --format json / tsv
  • Convert timestamps to human-readable format
  • For potentially large result sets, add LIMIT and offer pagination

Error handling

ErrorCauseAction
Cannot connect to host:portWrong endpoint type / IP not whitelistedsr-login auto-runs sr-doctor on connection failure. Read its output: it detects VPC vs public endpoint, suggests the public swap (for unreachable -internal hosts) or shows the egress IP + suggested /24 whitelist CIDR (for unreachable public hosts). Pass the recommendation to the user verbatim. See references/connect.md.
Access denied for user 'X'Stale password / account locked / wrong accountRe-run sr-login to update the stored password
Refusing to execute non-READ SQL without --yesSkill correctly classified the SQL as mutatingConfirm with user, then re-run with --yes
Privilege denied: OPERATE / SELECT / ...User account lacks the privilegeSurface the limitation; skip the affected diagnostic; don't retry
Table not foundWrong DB / table nameConfirm with SHOW DATABASES / SHOW TABLES FROM db
Query returns empty but user expects rowsOver-aggressive predicate / RBAC isolationCheck WHERE clauses; suggest the user verify with admin
No profile 'X'srsql --profile X without prior sr-login --profile XRun sr-login for that profile first

Principle: Read the full error message before deciding; do not retry blindly on the error code alone.

Related documents