<!-- Auto-generated for OpenClaw by pack-openclaw. Notes for OpenClaw users:
- Claude Code dynamic expressions (!`...`) in this file are NOT evaluated by OpenClaw
and appear as literal text. Run them manually at the start of the workflow.
- Invoke this skill only via slash command (e.g. /amg-check-cosmosdb-mongo-ru). Auto-invocation is
disabled on Claude Code but not on OpenClaw. -->
OpenClaw Setup (one-time)
This skill calls MCP tools prefixed with mcp__amg__*, so OpenClaw must have an MCP server registered under the exact name amg. Run this once per workspace before invoking the skill:
openclaw mcp set amg '{"url":"https://<your-grafana-instance>/api/azure-mcp","transport":"streamable-http","headers":{"Authorization":"Bearer <your-token>"}}'
Replace <your-grafana-instance> with your Azure Managed Grafana endpoint and <your-token> with a valid Grafana service-account token (starts with glsa_). The server name must be amg — the skill's allowed-tools reference mcp__amg__* and will not find tools under any other name.
Verify the server is registered:
openclaw mcp list
Official skill source: https://github.com/Azure/amg-skills
Runtime Context
- Current UTC time: !
date -u +%Y-%m-%dT%H:%M:%SZ
- Config: !
cat memory/amg-check-cosmosdb-mongo-ru/config.md 2>/dev/null || echo "NOT_CONFIGURED"
- Prior report: !
[ -f memory/amg-check-cosmosdb-mongo-ru/report.md ] && echo "exists ($(grep -c '^### BUG-' memory/amg-check-cosmosdb-mongo-ru/report.md) bugs documented)" || echo "not found"
- Arguments: time-range=$0, subscription-override=$1
Known Issues: Before presenting findings, cross-reference results against memory/amg-check-cosmosdb-mongo-ru/report.md.
Cosmos DB for MongoDB (RU) Health Check
Critical Constraints
- No subagents for MCP. The Agent tool cannot access MCP tools — all MCP calls must be made from the main context.
- Scan every resource. No sampling or early stopping.
- Time format: ISO 8601 UTC with explicit
from/to — NEVER use timespan (it causes errors).
- Safe interval: Always use
PT1H — it works for all Cosmos DB metrics. PT6H is NOT supported. DataUsage, IndexUsage, and DocumentCount do NOT support P1D.
- Parallelism cap: 30 concurrent MCP calls per batch. Reduce to 4-5 if rate-limited.
- Result too large: Save to temp file and parse outside the context window. Prefer
node -e "..." if installed; otherwise fall back to python -c "...", jq, or pwsh -Command "...". Bash permission for the chosen interpreter will be prompted on first use.
Progress Tracking
Update checkboxes as you complete each phase:
Configuration
If Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.
If Config is populated: Extract the datasource UID and subscription ID from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.
- Datasource UID: from
## Azure Monitor Datasource > UID
- Subscription ID: from
## Subscription (or $1 if provided)
- Resource Type:
microsoft.documentdb/databaseaccounts (lowercase) with kind == 'MongoDB'
Time Range
Default: 7 days for metrics, 24 hours for logs. Override with $0 (e.g., 3d). Keep log queries to 1-2 days to avoid timeouts.
Workflow
Phase 1a: Validate Datasource
Call amgmcp_datasource_list (no parameters). Find entry with type == "grafana-azure-monitor-datasource".
- Matches configured UID → proceed.
- Different UID → update
memory/amg-check-cosmosdb-mongo-ru/config.md, warn user, use new UID.
- Not found → abort with error.
Phase 1b: Discover All Cosmos DB for MongoDB (RU) Accounts
azureMonitorDatasourceUid: {DATASOURCE_UID}
query: |
resources
| where type == 'microsoft.documentdb/databaseaccounts'
| where kind == 'MongoDB'
| project name, resourceGroup, location, subscriptionId, id, properties.provisioningState
| order by location asc, name asc
If the config specifies subscription IDs (not "all"), add | where subscriptionId in ('{ID1}', '{ID2}'). Derive region summary by counting accounts per location. Flag accounts not in "Succeeded" state. Stop if zero accounts found.
Why kind == 'MongoDB'? Filters for RU-based MongoDB API accounts. vCore-based MongoDB uses microsoft.documentdb/mongoclusters.
Phase 1c: Activity Log for Non-Succeeded Accounts
If any accounts are not in "Succeeded" state, query the activity log for up to 3 of them:
azureMonitorDatasourceUid: {DATASOURCE_UID}
scope: {account's full ARM resource ID}
startTime: now-3d
endTime: now
select: eventTimestamp,operationName,status,caller,subStatus
If the response exceeds 500 KB, retry with startTime: now-1d. Summarize: operations performed, caller type, success/in-progress status, likely cause.
Phase 2: Validate Available Metrics
Call amgmcp_query_resource_metric_definition on the first account from Phase 1. Confirm expected metrics exist. Run only once — definitions are the same across all accounts.
Phase 3: Tier 1 — Fleet-Wide Pulse Check
azureMonitorDatasourceUid: {DATASOURCE_UID}
pastDays: 7
scenarios: cosmosdb_mongo
Scans all accounts across 3 scenarios: cosmosdb_mongo_ru, cosmosdb_mongo_throttling, cosmosdb_mongo_availability.
Before moving to Phase 4, verify:
scanSummary.totalResourcesScanned matches Phase 1 account count.
- All 3 scenarios show
status: "completed" in scenarioResults.
- If
errors non-empty, retry affected scenarios individually.
- If >10% accounts missing, fall back to batched
amgmcp_query_resource_metric for unscanned accounts.
Accounts in the findings array are abnormal. Also flag any non-Succeeded accounts from Phase 1.
Note: Sustained-high detection (>50% for 6+ hours), RU spike pattern detection (>30pp jump in 1h), and latency analysis require hourly time-series data and are performed in Phase 4 on flagged accounts only.
Phase 4: Tier 2 — Deep Metrics for Abnormal Accounts
Read reference/phase4-deep-metrics.md before starting Phase 4. It contains:
- Response size management (critical — fleet-wide PT1H queries exceed 500 KB)
- Fleet-wide triage strategy (when >50% accounts are flagged)
- Core and secondary metrics tables
- Batch strategy and correlation analysis patterns (use ultrathink)
Phase 5: Resource Logs for Abnormal Accounts
Read reference/phase5-resource-logs.md before starting Phase 5. It contains:
- 5 KQL query templates: throttling, high latency, request volume, top RU operations, error codes
- Fallback table guidance (CDBDataPlaneRequests if CDBMongoRequests is empty)
Output
Present the report using the structure in reference/output-format.md.
Classification:
| Severity | Criteria |
|---|
| CRITICAL | NormalizedRU = 100% sustained, OR ServiceAvailability < 99.9%, OR latency avg > 50ms |
| HIGH | NormalizedRU max 85-100% with frequent spikes, OR ReplicationLatency > 1000ms |
| WARNING | NormalizedRU max 70-85% sustained, OR sustained RU > 50% for 6h+, OR RU spike >30pp in 1h, OR ServiceAvailability < 99.99%, OR latency avg > 10ms, OR ReplicationLatency > 100ms |
| MODERATE | NormalizedRU max 50-70% |
| HEALTHY | All metrics within normal ranges (NormalizedRU < 50%) |
Update Known Issues
After presenting findings, update memory/amg-check-cosmosdb-mongo-ru/report.md:
- Read the current file.
- Rebuild the Resource Inventory table at the end: every account, full ARM ID, region, subscription, state. Group by region, sorted alphabetically.
- Update existing bug status from today's telemetry (resolved / improving / worsening / still active).
- Add new bugs with: severity, account name, region, metric evidence, log evidence, root cause, recommended action.
- Update the "Updated" date header.
Only add genuine issues: sustained throttling, availability drops, high latency patterns, or replication problems. Skip transient single-hour spikes or expected maintenance windows.
Error Handling
See reference/error-handling.md for the full recovery table.
Analysis Guidance
Reference
- Cosmos DB resource type:
microsoft.documentdb/databaseaccounts (kind: MongoDB)
- vCore resource type (different):
microsoft.documentdb/mongoclusters
- Latency metrics:
ServerSideLatencyDirect and ServerSideLatencyGateway (the old ServerSideLatency is deprecated)
- Resource log tables:
CDBMongoRequests (primary), CDBDataPlaneRequests (fallback)
- Key error codes:
429 / 16500 (throttling), 50 (server error), 13 (unauthorized)
- Safe metric interval:
PT1H for all metrics (PT6H NOT supported)
- Known issues:
memory/amg-check-cosmosdb-mongo-ru/report.md
- User config:
memory/amg-check-cosmosdb-mongo-ru/config.md
First-Run Setup
Run only when Config shows NOT_CONFIGURED. After completing, return to the Workflow above.
1. Discover Datasource UID: Call amgmcp_datasource_list. Filter type == "grafana-azure-monitor-datasource". Prefer uid == "azure-monitor-oob" if multiple match. Abort if zero match.
2. Discover Subscription ID: Run this Resource Graph query to list all subscriptions with Cosmos DB for MongoDB (RU) accounts, then present the results as a table and ask the user which subscription(s) to use:
resources
| where type == 'microsoft.documentdb/databaseaccounts'
| where kind == 'MongoDB'
| join kind=inner (
resourcecontainers
| where type == 'microsoft.resources/subscriptions'
| project subscriptionId, subscriptionName=name
) on subscriptionId
| summarize AccountCount=count() by subscriptionId, subscriptionName
| order by AccountCount desc
Present the results as a table with columns: Subscription Name, Subscription ID, Account Count. Then ask the user: "Which subscription ID(s) should I configure for this health check? Or type 'all' to scan all subscriptions."
3. Write config: Write memory/amg-check-cosmosdb-mongo-ru/config.md:
# amg-check-cosmosdb-mongo-ru Configuration
User-specific values for the Cosmos DB for MongoDB (RU) health check skill.
This file is auto-generated on first run and can be edited manually.
## Azure Monitor Datasource
- **UID**: {discovered_uid}
- **Name**: {discovered_name}
## Subscription
- {subscription_id_or_"all"}
4. Confirm: Show the resolved config and ask for confirmation before proceeding.