Install
openclaw skills install bookforge-growth-experiment-prioritization-scorerUse this skill to score and rank a growth experiment backlog using the ICE framework (Impact, Confidence, Ease — each rated 1–10 and averaged) and select the...
openclaw skills install bookforge-growth-experiment-prioritization-scorerScore and rank a growth experiment backlog using the ICE framework, then select the top experiments for the next sprint. The output is a structured scored backlog the team can bring directly to their weekly growth review meeting and begin acting on.
ICE — Impact, Confidence, Ease — was developed at GrowthHackers.com as a way to convert a pile of unordered experiment ideas into a ranked, actionable queue. The scoring is deliberately simple: three dimensions, a 1–10 scale each, averaged to one number. The value is not precision — it is shared calibration. A team that scores ideas against a common rubric before the meeting arrives at a defensible ranking in minutes rather than spending the full meeting debating which idea to try first.
Use this skill when:
high-tempo-experiment-cycle to install a weekly cadence and need to score the initial backlog before the first growth meetingPrerequisites before using this skill:
north-star-metric-selector or from the team's existing documentation). Impact scoring is meaningless without a shared reference metric.Before scoring, collect two inputs.
Input 1 — Experiment backlog file
Ask the user: "Please share the path to your experiment-backlog.md file, or paste the backlog content directly." Read the file. Each idea in the backlog should have:
If ideas are missing a hypothesis, flag them. You can score ideas without an effort estimate (Ease will be estimated), but you cannot score ideas without a hypothesis — without knowing what the experiment is supposed to change, Impact and Confidence scores are fabricated.
Input 2 — North Star metric
Ask: "What is your team's current North Star metric and focus area (e.g., 'weekly revenue per active user, focused on activation this sprint')?" If the user ran north-star-metric-selector, pull the NSM from north-star-recommendation.md. If unavailable, ask directly. Do not proceed to scoring without this — Impact scores are defined relative to the NSM.
Read experiment-backlog.md in full. Count the ideas. Note:
Why this step: Scoring a vague idea produces a fake ICE score. An idea named "improve checkout" with no hypothesis could mean anything — a copy tweak (days of work) or a full redesign (weeks). Flagging incomplete ideas before scoring forces the submitter to sharpen the idea, which is itself valuable and usually surfaces the key assumption the experiment is testing.
State the confirmed NSM explicitly: "NSM for this sprint: [metric name], defined as [precise definition with time dimension]."
Why this step: Impact scores are always relative to the NSM. "10" means "expected to move the NSM meaningfully in the next sprint." Without this anchor, a "9" from one team member means something completely different from a "9" from another — the scores are not comparable, and the ranking is not useful.
For each idea, assign an Impact score using this rubric:
| Score | Meaning |
|---|---|
| 9–10 | Expected to produce a large, direct movement in the NSM. The mechanism is clear and well-grounded. |
| 7–8 | Expected to produce a meaningful, measurable improvement in an NSM input metric. Direct NSM effect is plausible but not certain. |
| 5–6 | Expected to improve a metric adjacent to the NSM. Effect may be indirect or conditional. |
| 3–4 | Narrow improvement — affects a small user segment or a metric weakly connected to the NSM. |
| 1–2 | No credible mechanism to move the NSM or any meaningful input metric. May produce a local improvement that does not compound. |
Key calibration rule: a low-impact score is not a reason to discard an idea — it is a reason to pair it with high Ease or reconsider what it is actually testing. A quick, low-cost idea with modest impact still belongs in the queue; it just runs after the high-impact work.
Why this step: Impact calibration is the most common ICE failure mode. Teams inflate impact scores ("everything is an 8!") because they want their ideas to win, not because they have a clear mechanism for NSM movement. Anchoring to the NSM rubric forces the question: "What specifically will change in the NSM if this test wins?"
For each idea, assign a Confidence score using this evidence ladder:
| Score | Evidence Base |
|---|---|
| 9–10 | Strong prior evidence: internal data showing user behavior consistent with the hypothesis, plus a published case study or industry benchmark from a comparable company. Direct iteration of a past winning experiment at this company ("doubling down"). |
| 7–8 | Solid evidence: internal behavioral data supports the hypothesis, or a well-documented case study from a comparable product type. Team has run a related (not identical) experiment with positive results. |
| 5–6 | Moderate evidence: user survey data or interview data suggests the problem exists. No direct experimental evidence, but logic is sound and the mechanism is plausible. |
| 3–4 | Weak evidence: hypothesis is based primarily on team intuition or general industry patterns. No internal data support. May be correct, but the submitter is guessing. |
| 1–2 | Pure conjecture: no data, no precedent, no user evidence. The idea is speculative. May still be worth testing if Ease is high. |
Why this step: Confidence rewards epistemic honesty. A submitter who scores their own idea at 9 on confidence is claiming they have strong evidence — that is a testable claim the growth lead can check. Teams that cannot differentiate a data-backed hypothesis from a gut feeling are systematically over-investing in weak ideas. The evidence ladder gives submitters a shared vocabulary for calibrating uncertainty.
For each idea, assign an Ease score using this rubric:
| Score | Effort Level |
|---|---|
| 9–10 | Less than 1 day, single person, no dependencies. Copy change, image swap, minor UI adjustment. Can be shipped and measured within the current sprint. |
| 7–8 | 1–3 days, one or two people, minimal coordination. Small feature flag, minor backend change, email copy variant. |
| 5–6 | 3–7 days, involves a second function (e.g., engineering + design). Can probably be launched in the current sprint if started immediately. |
| 3–4 | 1–2 weeks, cross-team coordination required (e.g., engineering + product + legal review). Target the sprint after this one. |
| 1–2 | More than 2 weeks, significant product investment or cross-functional coordination. Requires a scheduled target date, not a sprint slot. |
Penalty rule for cross-team dependencies: if an experiment requires sign-off or execution from a team that is not on the growth team (e.g., a feature that requires a full product sprint, or an email campaign that requires legal review), reduce the Ease score by at least 2 points. Cross-team dependencies predictably slow experiments beyond the initial estimate.
Why this step: Ease provides the cycle's pacing mechanism. Without it, teams fill the sprint with high-impact, low-ease experiments that never finish and crowd out the learning the cycle needs to compound. Ease also surfaces low-hanging fruit — quick tests that might surprise the team, the way a newsletter form relocation at GrowthHackers produced a 700% lift despite a low initial impact estimate.
For each idea: ICE Score = (Impact + Confidence + Ease) / 3. Round to one decimal place.
Sort all ideas by ICE score, highest to lowest. This is the ranked backlog.
Why this step: Averaging — not summing — means the three dimensions are equally weighted and no single dimension can dominate. A 10/10/1 idea scores 7.0; a 7/7/7 idea also scores 7.0. The parity is intentional: an extremely high-impact idea that is nearly impossible to ship should not outrank a solid, executable idea just because the impact score is higher.
Produce a table in this format:
| Rank | Experiment Name | Impact | Confidence | Ease | ICE Avg | Status |
|------|-----------------|--------|------------|------|---------|--------|
| 1 | [Name] | 8 | 7 | 9 | 8.0 | Launch now |
| 2 | [Name] | 9 | 8 | 4 | 7.0 | Pipeline: [date] |
| ... | ... | ... | ... | ... | ... | ... |
After ranking, apply the triage:
Launch now — Ideas the team should select for the current sprint. Criteria: ICE average high enough to justify the team's time relative to other ideas in the queue, and Ease score high enough to be completable within one sprint (Ease ≥ 5 as a default threshold, adjustable by the growth lead). The number of "launch now" ideas should match team capacity — a 4-person team running 2 experiments per week selects 2–4 ideas; a 12-person team running 8 per week selects 8–12.
Pipeline with target date — Ideas that are worth running but cannot launch in the current sprint. This includes:
For each pipeline idea, assign a target sprint date using input from the relevant team (engineers estimate product work; marketers estimate channel experiments). Record the date in the scored backlog.
Drop — Ideas with ICE average below a threshold (default: ≤ 3.5) that lack a clear mechanism for NSM movement and have no strong evidence base. Before dropping, verify: is the low score due to a poorly written hypothesis that could be improved? If so, return it to the submitter with feedback. If the idea is genuinely low-value, mark it as dropped with a one-sentence reason.
Why this step: The pipeline is the team's most important asset. An empty pipeline means the weekly meeting runs out of candidates. But a pipeline filled with vague, undifferentiated ideas is equally useless — it creates the illusion of a full queue while the real work is selecting from noise. The three-bucket split forces explicit decisions rather than allowing ideas to accumulate indefinitely without a commitment.
Write experiment-scored-backlog.md with the following structure:
Impact is always relative to the North Star metric. There is no such thing as an impact score without a reference metric. An idea that scores 9 on impact for a retention-focused sprint would score 4 in an acquisition-focused sprint. Before scoring, confirm the NSM and the current focus area — every Impact score is anchored there.
Confidence rewards evidence over enthusiasm. Submitters naturally overestimate confidence in their own ideas. The evidence ladder exists to make the claim testable: "You scored this a 9 — show me the internal data and the case study you're referencing." If the submitter cannot produce evidence matching the ladder, the score should come down.
Ease penalizes cross-team dependencies because they predictably slow the cycle. An experiment that requires sign-off from a team outside the growth team almost always takes longer than estimated. The 2-point penalty is a calibration correction for the team's natural optimism about coordination costs.
ICE is deliberately blunt — the goal is a relative ranking, not an absolute forecast. A score of 7.3 vs. 7.1 is not meaningful. A score of 8.5 vs. 5.2 is. The system's value is in separating the top tier from the middle tier from the bottom tier, not in creating a precise numerical ranking. When two ideas are close, the growth lead uses judgment — the score is the starting point for the selection conversation, not the final word.
Run the scoring before the meeting, not in it. ICE scoring done synchronously in the weekly growth meeting burns the entire meeting on a task that can be done asynchronously. The scored backlog is prep work. The meeting uses the ranked list to make the selection decision in 15 minutes.
The lowest-scoring ideas can still be the biggest winners. The 700% newsletter lift came from a form relocation that scored 4 on Impact. Score informs selection order, not selection cutoff. If an idea is quick to run and the downside of being wrong is minimal, run it — the cost of not learning is real.
A B2B SaaS team of 5 (growth PM, engineer, designer, marketer, data analyst) has 20 experiment ideas targeting activation — getting new trial users to the "aha moment" (≥3 team members using a shared project within 7 days). North Star: weekly collaborative sessions per team.
The scored backlog (abbreviated) looks like:
| Rank | Experiment | Impact | Confidence | Ease | ICE | Status |
|---|---|---|---|---|---|---|
| 1 | Add progress bar to team setup wizard | 8 | 7 | 9 | 8.0 | Launch now |
| 2 | In-app prompt to invite 2nd team member at step 3 | 8 | 8 | 7 | 7.7 | Launch now |
| 3 | Email sequence: days 1, 3, 7 post-signup | 7 | 8 | 6 | 7.0 | Launch now |
| 4 | Rebuild onboarding checklist UI | 9 | 6 | 3 | 6.0 | Pipeline: next sprint |
| 5 | Integration with Slack for task notifications | 8 | 5 | 2 | 5.0 | Pipeline: 3 weeks (eng scoping) |
| ... | ... | ... | ... | ... | ... | ... |
| 19 | Add mascot to empty state screens | 3 | 4 | 8 | 5.0 | Drop |
| 20 | Redesign marketing site hero | 4 | 3 | 2 | 3.0 | Drop |
Sprint selection: The growth PM selects experiments 1, 2, and 3 for the sprint (matching the team's capacity of 2–3 tests per week). Experiment 4 is slotted for the following sprint pending design mockups. Experiment 5 goes to engineering for a scope estimate. Experiments 19 and 20 are dropped — neither has a clear mechanism to increase collaborative sessions, which is the current NSM.
A marketplace team of 12 (two growth squads: acquisition + retention) needs to groom 50 ideas for a retention sprint targeting 90-day repeat purchase rate. North Star: orders per buyer per quarter.
Scoring challenge: The retention squad notices that 15 of the 50 ideas are acquisition ideas submitted by the other squad — they score well on Impact but target a different focus area. The growth lead moves all 15 to a separate acquisition backlog without scoring them in the retention sprint.
The remaining 35 retention ideas are scored. The top 8 (ICE ≥ 6.5) are selected for the sprint — 4 per squad per week. Ideas ranked 9–20 (ICE 4.5–6.4) are pipelined with dates. Ideas 21–35 (ICE ≤ 4.4) are reviewed for hypothesis quality: 5 are returned to submitters for clarification; 10 are dropped.
Key finding from the scoring: Three ideas that the retention squad assumed were quick (scoring themselves 8 on Ease) were recalibrated to Ease 4 by the growth lead after consulting the engineering team — all three required A/B test infrastructure that was not yet in place. Moving them to the pipeline freed up sprint slots for two mid-ranked ideas (ICE 6.2) that were genuinely fast to ship. The sprint launched on schedule.
research/growth-experiment-prioritization-scorer.md — source passages from Chapter 4 on ICE scoring, evidence definitions, and pipeline managementreferences/ice-scoring-guide.md — worked scoring examples by experiment typeorchestration/specs/skill-spec.md — BookForge skill authoring standardsThis skill is licensed under CC-BY-SA-4.0. Source: BookForge — Hacking Growth by Sean Ellis and Morgan Brown.
This skill is the Prioritize stage of the high-tempo experiment cycle and consumes the North Star metric produced upstream:
# Required upstream — NSM must be confirmed before scoring
clawhub install bookforge-north-star-metric-selector
# The cycle this skill plugs into — Prioritize is Stage 3
clawhub install bookforge-high-tempo-experiment-cycle
# Feeds activation experiment ideas into the backlog
clawhub install bookforge-activation-funnel-diagnostic
# Feeds retention experiment ideas into the backlog
clawhub install bookforge-retention-phase-intervention-selector
Browse more: bookforge-skills