Install
openclaw skills install agent-causalAgent Causal Decision Tool helps you and your AI agents answer one question from experiment data: should we ship this change, keep running the test, or roll it back? Returns structured JSON decisions, key statistics, and audit trails from A/B tests (frequentist + Bayesian), DiD, cohort/segment analysis, and sequential early stopping.
openclaw skills install agent-causalA causal decision and audit tool for AI agents. Evaluate product changes using A/B testing, Difference-in-Differences, and sequential early stopping.
Source: https://github.com/ZhuMorris/agent-causal-decision-tool
Agent Causal Decision Tool helps you and your AI agents answer one question from experiment data: "should we ship this change, keep running the test, or roll it back?" It takes in simple A/B or rollout summaries and returns a structured JSON decision, key statistics, and an audit record you can store or review later.
Rather than being a full experimentation platform, it is a decision engine. You bring the data (from your logs, BI tool, or CSV); it handles the stats, decision logic, and audit trail.
In many teams, experiment decisions happen in ad hoc spreadsheets or dashboards. People glance at lift, argue about whether the sample size is enough, and sometimes ship features based on noisy or biased results. Agents make this worse if they are wired to react to any small uplift they see.
This tool wraps a few standard methods into one consistent, agent‑friendly interface:
decide) — no need to know which statistical method to use. Paste your numbers and it auto-selects A/B, Bayesian, DiD, or planning from your input fields.The goal is not to replace your analytics stack, but to give agents a small, reliable decision block they can call inside workflows.
Use this tool whenever you or your agents have experiment or rollout results and need a decision you can defend:
decide auto-detect from your numbers.This skill runs as a local CLI tool only. After one-time setup, it requires no network access.
Setup (one-time, before first use):
# Download the release tarball — no git clone needed
curl -sL https://github.com/ZhuMorris/agent-causal-decision-tool/archive/refs/tags/v0.10.2.tar.gz -o agent-causal.tar.gz
tar -xzf agent-causal.tar.gz
pip install agent-causal-decision-tool-0.10.2/ -q
After installation, the agent-causal command is available locally. The tool performs only local statistical calculations.
When network access occurs: The PostHog connector (agent-causal connect posthog) makes outbound HTTPS requests to your PostHog instance only when explicitly invoked — never automatically. If you do not use the connector, no outbound network access is needed at any point.
No runtime network access during analysis. The decision engine, audit, and cohort analysis do not make outbound requests.
Tools used: exec (for running the agent-causal CLI commands you specify). Commands are fully hardcoded with no user-supplied strings interpolated into shell execution.
PostHog token scope: Use a read-only API token with minimal scopes. Do not use tokens with write or admin permissions.
Credential handling: PostHog API credentials are read from env vars (POSTHOG_API_KEY/POSTHOG_PROJECT_ID) or a local ~/.posthogrc file — never hardcoded or logged.
Before using this skill, install the tool (one-time):
git clone https://github.com/ZhuMorris/agent-causal-decision-tool.git ~/clawd/agent-causal-decision-tool
pip install ~/clawd/agent-causal-decision-tool -q
After this, agent-causal is available as a local command. No further network access is required.
For AI agent integrations, Agent Causal exposes a JSON-RPC 2.0 API over both stdio and HTTP.
python -m src.api stdio
python -m src.api http --port 8000
| Action | Description |
|---|---|
decide | Easy-mode dispatcher — auto-selects A/B, Bayesian, DiD, or planning from your input fields |
decide_ab | Frequentist A/B test (mode: frequentist) or Bayesian (mode: bayesian) |
decide_rollout | DiD for staged rollouts / quasi-experiments |
plan_test | Experiment planning (sample size, MDE, feasibility) |
audit_result | Full audit of a stored result by ID |
save_result | Persist a decision result to SQLite history |
get_result | Retrieve a stored result by ID |
compare_results | Compare multiple stored experiments |
connect | Fetch experiment data from an external connector (e.g. PostHog) |
{
"jsonrpc": "2.0",
"method": "decide_ab",
"params": {
"mode": "frequentist",
"input": {
"control_conversions": 100,
"control_total": 5000,
"variant_conversions": 130,
"variant_total": 5000
}
},
"id": 1
}
{
"decision": "ship|keep_running|reject|escalate",
"recommended_next_action": "Deploy variant — statistical significance achieved with positive lift.",
"selected_method": "ab_test|bayesian_ab|did|planning",
"selection_reason": "Why this method was chosen",
"confidence": "high|medium|low",
"effect_summary": "Estimated lift: +30.00% (positive)",
"warnings": [{"code": "LOW_TRAFFIC", "message": "...", "severity": "info"}],
"limitations": ["No multiple testing correction applied"],
"audit_summary": "ab_test: Decision",
"source_metadata": {"connector": "langsmith", "dataset_id": "ds-001"},
"internal_result": { ... }
}
{
"code": "VALIDATION_ERROR",
"message": "Invalid A/B test inputs",
"data": {
"details": [{"field": "control_total", "issue": "must be >= 1"}],
"request_id": null
}
}
decide)Don't know which method you need? decide auto-detects from your input fields — no need to pick the right command:
# A/B test (auto-detected)
PYTHONPATH=. python3 -m src.cli decide --control 100/5000 --variant 130/5000
PYTHONPATH=. python3 -m src.cli decide --control 100/5000 --variant 130/5000 --format text
# Bayesian A/B (--bayesian flag)
PYTHONPATH=. python3 -m src.cli decide --control 100/5000 --variant 130/5000 --bayesian
# DiD / staged rollout (auto-detected from pre/post treated fields)
PYTHONPATH=. python3 -m src.cli decide --pre-control 1000 --post-control 1200 --pre-treated 200 --post-treated 280
# Experiment planning (auto-detected from --baseline + --mde)
PYTHONPATH=. python3 -m src.cli decide --baseline 0.05 --mde 10 --traffic 10000
Auto-detection matrix:
| You provide... | It runs... |
|---|---|
--control + --variant | Frequentist A/B |
--control + --variant + --bayesian | Bayesian A/B |
--pre-control + --post-control + --pre-treated + --post-treated | DiD (Difference-in-Differences) |
--baseline + --mde | Experiment planning |
JSON-RPC: {"jsonrpc":"2.0","method":"decide","params":{...fields...},"id":"1"} — same auto-detection over stdio or HTTP.
Estimate required sample size, duration, and feasibility before running an experiment:
cd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli plan --baseline 0.02 --mde 5 --traffic 5000
Parameters:
--baseline (required): Baseline conversion rate (e.g., 0.02 for 2%)--mde (required): Minimum detectable effect as % lift (e.g., 5 for 5% lift)--traffic (required): Daily traffic per arm--confidence (default 0.95): Confidence level--power (default 0.8): Statistical power--allocation: equal (default) or custom--allocation-ratio: Custom ratio when allocation=custom (e.g., 0.3/0.7)--format: json (default) or textPlanning output:
{
"mode": "planning",
"recommendation": {
"decision": "feasible|slow|not_recommended",
"confidence": "high|medium|low",
"summary": "..."
},
"planning": {
"required_sample_per_arm": 182934,
"total_required": 365868,
"estimated_days": 36.6,
"feasibility": "slow",
"allocation_used": {"control": 0.5, "variant": 0.5}
},
"warnings": [...]
}
Feasibility thresholds:
feasible: ≤14 daysslow: 15–60 daysnot_recommended: >60 dayscd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli ab --control 100/5000 --variant 130/5000
Parameters:
--control: Control group conversions/total (e.g., 100/5000)--variant: Variant group conversions/total (e.g., 130/5000)--name: Variant name (optional, default: variant_1)--format: Output format json (default) or textSequential / Early Stopping (optional):
--sequential/--no-sequential: Enable sequential early stopping evaluation--experiment-start, --experiment-end: ISO 8601 timestamps for runtime calculation--min-runtime-days (default 7): Minimum days before early stop is considered--min-sample-per-arm (default 2000): Minimum sample per arm before early stop--early-stop-p (default 0.01): p-value threshold for early stop--max-runtime-days: Hard cap; escalates if exceeded without strong resultTrigger logic: Both min-runtime AND min-sample-per-arm must be met, AND p-value below --early-stop-p. Max runtime exceeded always escalates.
PYTHONPATH=. python3 -m src.cli ab --control 100/5000 --variant 130/5000
Example Output:
{
"schema_version": "0.8.0",
"mode": "ab_test",
"recommendation": {
"decision": "ship",
"confidence": "medium",
"summary": "Variant performs 30.00% better (p=0.0454). Ship it.",
"primary_metricLift": 30.0,
"p_value": 0.045361
},
"statistics": {
"control_rate": 0.02,
"variant_rate": 0.026,
"relative_lift_pct": 30.0,
"z_score": 2.0013,
"p_value": 0.045361,
"lift_ci_95": [0.000124, 0.011876],
"relative_lift_ci_95": [0.619, 59.381]
},
"traffic_stats": {
"control_size": 5000,
"variant_size": 5000,
"total_size": 10000
},
"warnings": [],
"next_steps": ["Deploy variant", "Monitor over time for regression"],
"audit": {
"decision_path": [
{"step": "Input validation", "passed": true},
{"step": "Traffic check", "passed": true},
{"step": "Conversion rate calculation", "passed": true},
{"step": "Statistical significance test", "passed": true},
{"step": "Effect size check", "passed": true},
{"step": "Decision", "passed": true}
]
}
}
cd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli bayes --control 100/5000 --variant 130/5000
Uses Beta-Binomial conjugate model with Jeffreys prior.
Parameters:
--control, --variant: Conversions/total (same as ab)--name: Variant name--format: json (default) or text--samples: Monte Carlo samples (default: 20000)Example output:
{
"schema_version": "0.8.0",
"timestamp": "2026-05-06T13:00:00.000Z",
"mode": "bayesian_ab",
"recommendation": {
"decision": "ship",
"confidence": "medium",
"summary": "Variant wins with P(better)=0.976. Median lift=30.10%. Ship.",
"primary_metricLift": 30.10,
"p_value": 0.9758
},
"statistics": {
"control_rate_observed": 0.0200,
"variant_rate_observed": 0.0260,
"relative_lift_pct": 30.00,
"posterior_control": {"alpha": 100.5, "beta": 4900.5, "mean": 0.0201},
"posterior_variant": {"alpha": 130.5, "beta": 4870.5, "mean": 0.0261},
"p_variant_wins": 0.9758,
"p_control_wins": 0.0239,
"p_tie": 0.0003,
"lift_median_pct": 30.10,
"lift_95ci_pct": [0.20, 69.15],
"expected_lift_hdi_95": [0.0004, 0.0014],
"relative_lift_hdi_95": [2.00, 70.00],
"monte_carlo_samples": 20000,
"prior_used": {"alpha": 0.5, "beta": 0.5, "type": "Jeffreys"}
},
"traffic_stats": {
"control_size": 5000,
"variant_size": 5000,
"total_size": 10000
},
"warnings": [],
"next_steps": ["Deploy variant", "Monitor for regression"],
"audit": {
"experiment_type": "bayesian_ab",
"thresholds_applied": {"ship": 0.95, "reject": 0.05},
"assumptions": ["Independent observations between groups", "No selection bias in group assignment", "Jeffrey's prior is appropriate for conversion rates"],
"limitations": ["Monte Carlo simulation has finite sampling error", "No multiple testing correction applied"]
},
"inputs": {
"control_conversions": 100,
"control_total": 5000,
"variant_conversions": 130,
"variant_total": 5000
}
}
Access fields via . attribute (e.g. result.recommendation.decision) or .model_dump() for dict. Serialize with .model_dump_json().
When to use Bayesian vs Frequentist:
cd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli did --pre-control 1000 --post-control 1100 --pre-treated 900 --post-treated 1150
Parameters:
--pre-control: Control group metric before treatment--post-control: Control group metric after treatment--pre-treated: Treated group metric before treatment--post-treated: Treated group metric after treatment--n-bootstrap (default 2000, range 500–10000): Number of bootstrap resamples for DiD CIWhen an aggregate A/B or DiD result is inconclusive, break down results by user segment to find hidden signals:
cd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli cohort-breakdown --file segments.json
Input format (JSON):
{
"experiment_id": "checkout-v3",
"metric": "conversion_rate",
"prior_result_id": "dec_20260501_001",
"prior_decision": "wait",
"segments": [
{
"segment_name": "new_users",
"segment_definition_note": "Users registered within last 30 days",
"control_conversions": 21,
"control_total": 1000,
"variant_conversions": 67,
"variant_total": 1000
},
{
"segment_name": "returning_users",
"segment_definition_note": "Users registered more than 30 days ago",
"control_conversions": 220,
"control_total": 4000,
"variant_conversions": 228,
"variant_total": 4000
}
]
}
Input format (CSV alternative):
segment_name,segment_definition_note,arm,conversions,total
new_users,Users registered within last 30 days,control,21,1000
new_users,Users registered within last 30 days,variant,67,1000
returning_users,Users registered more than 30 days ago,control,220,4000
returning_users,Users registered more than 30 days ago,variant,228,4000
Parameters:
--file: Path to JSON or CSV segment file--json: JSON input string (alternative to --file)--format: Output format json (default) or text--save: Save result to experiment historyMultiple comparison correction:
--method bonferroniExample output:
{
"method": "experiment_cohort_breakdown",
"cohort_decision_override": true,
"cohort_override_reason": "Strong positive signal in 'new_users' (lift=219.0%, adj-p=0.0000) contradicts aggregate decision 'wait'",
"interaction_flag": false,
"segments": [
{
"segment_name": "new_users",
"control_rate": 0.021,
"variant_rate": 0.067,
"relative_lift_pct": 219.05,
"p_value_raw": 0.0000,
"p_value_adjusted": 0.0000,
"decision": "strongly_positive",
"priority_rank": 1
}
],
"priority_ranking": [
{"rank": 1, "segment": "new_users", "rationale": "Strong positive effect (lift=219.1%, adj-p=0.0000). Highest priority."}
],
"summary": "new_users drives the effect. 1 segment(s) positive.",
"recommended_next_action": "targeted_rollout"
}
When to use cohort breakdown:
keep_running or escalate — segment analysis may reveal a hidden signalKey features:
Reconstruct and explain a previous decision:
# Save result to file
PYTHONPATH=. python3 -m src.cli ab --control 100/5000 --variant 130/5000 > /tmp/result.json
# Audit it (human-readable)
PYTHONPATH=. python3 -m src.cli audit /tmp/result.json --format text
# Audit with experiment maturity assessment
PYTHONPATH=. python3 -m src.cli audit /tmp/result.json --maturity
Maturity assessment (with --maturity flag):
mature (≥90), adequate (≥70), immature (≥50), inadequate (<50)Example audit output:
-- DECISION PATH --
1. Input validation [✓]
control_total: 5000, variant_total: 5000
2. Traffic check [✓]
control_size: 5000, min_required: 1000
3. Conversion rate calculation [✓]
control_rate: 0.02, variant_rate: 0.026
4. Statistical significance test [✓]
p_value: 0.045361, alpha: 0.05
5. Effect size check [✓]
lift_pct: 30.0, threshold: 1
6. Decision [✓]
decision: ship, confidence: medium
-- FINAL DECISION --
Decision: SHIP
All commands support --save to persist results to local SQLite history:
# Run and save in one step
PYTHONPATH=. python3 -m src.cli ab --control 100/5000 --variant 130/5000 --save
PYTHONPATH=. python3 -m src.cli did --pre-control 1000 --post-control 1100 --pre-treated 900 --post-treated 1150 --save
PYTHONPATH=. python3 -m src.cli plan --baseline 0.02 --mde 5 --traffic 5000 --save
History commands:
# List recent experiments
PYTHONPATH=. python3 -m src.cli history
PYTHONPATH=. python3 -m src.cli history --mode ab_test --limit 10
# Compare multiple experiments by ID
PYTHONPATH=. python3 -m src.cli compare 1 2 3
# Save a prior JSON result file to history
PYTHONPATH=. python3 -m src.cli save /tmp/result.json --name "checkout-v3-test"
History output example:
ID Date Mode Decision Lift P-value Summary
--------------------------------------------------------------------
3 2026-04-30 did ship 16.67 - Treatment effect is 150.00...
2 2026-04-30 ab_test escalate 6.25 0.6947 Results not conclusive...
1 2026-04-30 ab_test ship 30.00 0.0454 Variant performs 30.00%...
Compare output example:
EXPERIMENT COMPARISON
==================================================
Experiments compared: 3
Summary by decision:
SHIP: 2 experiment(s)
ESCALATE: 1 experiment(s)
Summary by mode:
ab_test: 2 experiment(s)
did: 1 experiment(s)
Lift summary: max=30.00%, min=6.25%, avg=17.64% (3 experiments)
Attention needed: 2 experiments recommend ship. Review if they test the same metric.
Persistence:
~/.agent-causal/history.dbab_test, did, planning| Decision | Meaning | When |
|---|---|---|
ship | Deploy variant | p < 0.05 AND positive lift |
keep_running | Continue experiment | p < 0.3, trending positive |
reject | Do not deploy | p < 0.05 AND negative lift |
escalate | Needs human review | Not conclusive or critical warnings |
targeted_rollout | Ship to specific segment only | Strong signal in one segment, aggregate inconclusive |
full_rollout | Ship to all users | All segments positive |
abandon_segment | Do not ship to specific segment | Strong negative in one segment despite aggregate ship |
confirm_rejection | Confirm abandonment | All segments negative |
import sys
sys.path.insert(0, '~/clawd/agent-causal-decision-tool')
from src.ab_test import calculate_ab
result = calculate_ab({
"control_conversions": 100,
"control_total": 5000,
"variant_conversions": 130,
"variant_total": 5000
})
if result.recommendation.decision == "ship":
# Deploy variant
pass
The tool exposes a versioned schema contract for agent consumption:
cd ~/clawd/agent-causal-decision-tool
PYTHONPATH=. python3 -m src.cli schema
This prints schema.json — a wrapper containing schema_version, schema_coverage (ab, did, plan, bayes), schema_coverage_pending (cohort), severity_contract, and definitions (JSON Schema from Pydantic models).
All output models include schema_version field injected from package metadata — never hardcoded.
~/clawd/agent-causal-decision-tool/Fetch experiment data directly from external sources. The connect action normalizes external data into the internal experiment schema before running a decision.
# Health check (validates credentials, no data fetched)
PYTHONPATH=. python3 -m src.cli connect posthog --dry-run
# Fetch experiment and print normalized data
PYTHONPATH=. python3 -m src.cli connect posthog --experiment-id <id>
# Fetch and run through decision workflow automatically
PYTHONPATH=. python3 -m src.cli connect posthog --experiment-id <id> --decide
# JSON-RPC call
{"jsonrpc":"2.0","method":"connect","params":{"source":"posthog","experiment_id":"<id>"},"id":"1"}
Environment / config:
POSTHOG_API_KEY + POSTHOG_PROJECT_ID env vars, OR~/.posthogrc with api_key, project_id, instance_url fieldsConnector result schema:
{
"data": { "control_conversions": 120, "control_total": 5000, "variant_conversions": 145, "variant_total": 5000 },
"source_metadata": { "connector": "posthog", "experiment_id": "...", "fetch_timestamp": "..." },
"warnings": []
}
Errors:
INSUFFICIENT_DATA — experiment found but missing required fieldsConnectorAuthError — invalid/missing API keyConnectorNotFoundError — experiment not found