Install
openclaw skills install ab-interpreterInterpret A/B test results for ecommerce campaigns and pages by checking statistical significance, practical effect size, and next-step recommendations.
openclaw skills install ab-interpreterMost ecommerce A/B tests get called too early, too late, or on the wrong metric — and the team ships whichever variant "looks like" it won without a clean read on whether the lift was real, large enough to matter, or durable past the novelty window. This skill turns raw test results into a disciplined go or no-go verdict with statistical significance, practical effect size, segment checks, and a prescribed next step so every test actually moves the business.
| Decision | Strong signal | Acceptable | Weak / Redesign |
|---|---|---|---|
| Statistical significance | p < 0.05, two-tailed, sample powered to 80% | p between 0.05 and 0.10 with large sample | p > 0.10 or sample underpowered |
| Practical effect size | Lift exceeds MDE and the lower bound of the 95% CI is positive | Lift exceeds MDE but CI barely crosses zero | Lift below MDE even if "significant" |
| Test duration | Ran across at least two full weekly cycles | Ran 10 to 14 days, no major anomalies | Under 7 days or overlaps a holiday/promo |
| Sample ratio | Observed split within 1% of planned | Split within 2% | SRM > 2% — suspect tracking |
| Guardrail metrics | All guardrails flat or positive | One guardrail flat, one minor regression | Any revenue, refund, or AOV guardrail regressed |
| Segment stability | Lift positive across all key segments | Lift positive in 3 of 4 segments | Contradicting segments (new vs returning, device, geo) |
| Novelty risk | Lift holds in final week of test | Lift decays mildly over time | Lift front-loaded and decays past week two |
Inputs. Hypothesis: "Buy Now" outperforms "Add to Cart" on the product page. 14-day test, 50/50 split, 48,200 visitors each arm. Control conversions: 1,446 (3.00%). Variant conversions: 1,640 (3.40%). MDE set at 0.25 percentage points. Guardrails: refund rate and AOV.
Analysis. Two-proportion z-test gives p = 0.02 with a 95% CI on the lift of +0.08 to +0.72 pp. Observed lift of 0.40 pp exceeds MDE of 0.25. Sample split was 50.1 / 49.9. New vs. returning both show directional lift. Refund rate flat. AOV down 1.2% (not significant, CI crosses zero). Week one lift 0.45, week two 0.35 — mild decay, within tolerance.
Verdict. Ship. The result is statistically and practically significant; guardrails are intact. Next step: monitor AOV weekly for four weeks post-rollout and re-test "Buy Now" against a bundled offer CTA in Q3.
Inputs. Two subject lines sent to 8,000 recipients each. Opens: A = 1,440 (18.0%), B = 1,560 (19.5%). MDE of 2 percentage points was required to justify the winner for list-wide rollout.
Analysis. Two-proportion z-test p = 0.022 — "significant". Observed lift of 1.5 pp is below the 2.0 pp MDE. 95% CI on the lift is +0.2 to +2.8, straddling the MDE. No segment breakdown provided.
Verdict. Do not ship yet. The result is statistically significant but falls short of the practical threshold the team pre-committed to. Extend the test or re-run with a larger list to narrow the CI. Do not rationalize the "win" just because p < 0.05 — the pre-committed MDE exists precisely to prevent that.
references/output-template.md — The four-block readout structurereferences/statistical-tests.md — Choosing the right test for the metric typereferences/segment-playbook.md — Standard segments to cut and what each contradiction signalsassets/ab-readout-checklist.md — Pre-flight checklist for every readout you deliver