Install
openclaw skills install bug-bash-facilitatorPlan, run, and debrief a structured pre-release bug bash session. Defines scope (areas, devices, user roles), invites the right participant mix (eng/PM/design/CS/sales), generates exploratory test charters, runs in-session severity triage, post-session de-duplication, fix prioritization, and the retro. Use when asked to organize a bug bash, run a pre-launch test sprint, set up a hackathon-style QA day, write test charters, triage bugs across teams, or build a bug intake form. Triggers on "bug bash", "bug hunt", "test sprint", "exploratory testing", "test charter", "session-based testing", "pre-release QA", "release readiness", "triage session", "go/no-go review".
openclaw skills install bug-bash-facilitatorPlan, run, and debrief a high-signal bug bash before a release. Acts as an experienced release manager / staff QA who has run dozens of these for SaaS products. Outputs the artifacts you need to actually ship: scope doc, charter board, intake form, triage rubric, prioritized fix list, retro notes.
Invoke this skill when a release is 1-2 weeks out and you want a coordinated, time-boxed exploratory test pass — not a substitute for unit/integration/E2E automation, but the layer that catches the things automation cannot.
Basic invocation:
Plan a bug bash for our v3.2 release shipping next Friday We launch onboarding redesign in 10 days, run a bug bash for it Generate test charters for the billing area Build the bug intake template for tomorrow's bash
With context:
12 engineers, 3 PMs, 2 designers, 2 CS, 1 sales — design the invite list Here are 47 bugs filed during yesterday's bash, de-dupe and triage We have 6 hours total on Tuesday — give me the run-of-show Last bash had 80% noise, redesign the intake form
The agent produces the eight runbook artifacts plus templates ready to paste into Slack, Notion, Linear, or Jira.
A bash without a scope doc devolves into "click around for an hour." The agent forces the scope to declare four axes before invites go out.
| Axis | What to declare | Example |
|---|---|---|
| Surfaces | Named features and screens. Exclude legacy. | New onboarding flow, billing dashboard, settings v2 — exclude admin console |
| Devices | OS + browser + form factor matrix | macOS Chrome, Windows Edge, iOS Safari, Android Chrome, iPad Safari |
| User roles | Personas to test as | Free trial, paid solo, paid team admin, paid team member, suspended |
| Out of scope | Explicit non-goals, pinned to stop scope creep | Performance, accessibility audit, copy review, integrations beyond Slack |
Anti-pattern: "Test the whole product." A 6-hour bash with 15 people on the whole product yields shallow coverage everywhere and depth nowhere. Pick 3-5 surfaces, exhaust them.
Scope doc template — declare goal, time box, in-scope surfaces, device matrix, roles with seeded test accounts, explicit out-of-scope items, severity rubric link, intake form link, and war room (Slack channel + always-on Zoom). Lock all six sections before invites — late additions destroy charter assignments.
The participant ratio decides what kinds of bugs get found. Engineers find race conditions and edge cases; CS and sales find "users will complain about this." A bash that's 100% engineers misses the entire UX layer.
Recommended mix for a 11-person bash:
| Role | Count | What they catch |
|---|---|---|
| Engineering | 5 | Edge cases, error states, concurrency, console errors, API misuse |
| Product Management | 2 | Spec drift, missing flows, requirements ambiguity |
| Design | 1 | Visual regressions, micro-interactions, copy tone |
| Customer Success / Support | 2 | "This will generate tickets" pattern recognition |
| Sales / Solutions | 1 | Demo-killer bugs, enterprise-customer landmines |
Rules:
For smaller orgs, the agent scales the mix proportionally:
6-person bash: 3 eng / 1 PM / 1 design / 1 CS
11-person bash: 5 eng / 2 PM / 1 design / 2 CS / 1 sales
20-person bash: 9 eng / 3 PM / 2 design / 3 CS / 2 sales / 1 marketing
A charter is a 1-2 sentence mission for a 60-90 minute exploratory session. It frames what to explore without scripting steps. Charters outperform scripted test cases for finding novel bugs because they preserve human creativity inside a bounded mission.
Charter formula (Cem Kaner / James Bach session-based testing):
Explore [target]
with [resources]
to discover [information]
Sample charters by SaaS surface:
AUTH: Explore signup, login, password reset with multiple browsers,
+alias emails, and an account that already exists, to discover
state confusion, lockouts, concurrent-session races, and email
deliverability failures.
BILLING: Explore plan upgrade/downgrade/seat changes with trial-to-paid,
annual-to-monthly, prorated mid-cycle upgrade, and card decline
mid-checkout, to discover charge correctness, invoice line-item drift,
webhook idempotency failures, and access-rights timing bugs.
DASHBOARD: Explore with empty, partial, and large datasets (0, 5, 5000
rows) under slow network, stale tokens, and tab-switching, to discover
loading-state regressions, infinite spinners, data freshness, and
pagination edge cases.
ONBOARDING: Explore as first-time user on desktop and mobile, plus a
returning user who abandoned at step 3 yesterday, to discover step
skipping, back-button breakage, missing validation, and persistence
bugs.
MOBILE: Explore fresh-install iOS with Wi-Fi off mid-session, push
disabled, Face ID disabled, and 30+ min backgrounded, to discover
offline handling, notification opt-in, session expiry, cold-start.
INTEGRATIONS: Connect/disconnect Slack, Google, Stripe webhooks with
revoked tokens, expired tokens, and rate-limited upstreams, to discover
error clarity, retry behavior, and orphaned data after disconnect.
PERMISSIONS: Explore as admin/member/viewer/suspended with role changes
during active session and cross-tab role drift, to discover privilege
escalation, stale UI, and 403 handling.
DATA EXPORT/IMPORT: CSV with empty, large, malformed files; non-ASCII;
CRLF vs LF; Excel-saved-as-CSV, to discover encoding bugs, parser
brittleness, and partial-failure reporting.
Charter assignment: pair each charter with 1-2 testers and a target device. Two testers per charter is the sweet spot — one explores while the other watches and asks "what about…", which catches what a solo tester habituates past.
A bad intake form produces 50 bugs that all say "Submit button broken on Safari" with no repro. The agent enforces a strict template — required fields gate submission.
Intake form fields (all required unless noted):
[Surface] [Action] [Result] — e.g. "Billing — Annual toggle — Price doesn't update"Triage in real time, not after. Stale bugs become stale priorities. The agent runs a 15-minute triage huddle every 60-90 minutes during the bash.
P0/P1/P2/P3 rubric with hard tests:
| Severity | Hard test | Examples |
|---|---|---|
| P0 | Affects all users OR involves money/data/security/legal. Blocks release, no exceptions. | Payment double-charge, data loss on save, auth bypass, GDPR/PII leak, total outage |
| P1 | Core happy path is broken for a non-trivial segment (>5% users or any paid tier). Blocks release unless explicit waiver. | Login fails on Safari, signup form rejects valid emails, dashboard blank for paid users |
| P2 | Workaround exists, affects narrow segment, or is recoverable. Ship-with allowed. | Tooltip cuts off, sort order wrong, edge case on iPad landscape |
| P3 | Cosmetic, polish, or enhancement. Backlog. | 1px off alignment, copy nit, missing animation |
The "would I tweet about this?" test: if a normal user would tweet a screenshot of this bug to complain, it's at least P1.
The "would Stripe/Slack ship this?" test: cosmetic regressions in core flows (typography in checkout, alignment in the inbox) are P1 for category-leading products even if functionally fine.
Triage huddle script (15 min):
0:00 Facilitator screen-shares the new bug list (filter: filed since last triage)
0:02 Round-robin — filer reads the title only, says "P0/P1/P2/P3"
0:08 Group challenges any P0/P1; facilitator assigns owner if confirmed
0:12 P2/P3 marked "post-bash review" — no debate now
0:14 Facilitator notes blockers and unblocks testers (test data, env, perms)
0:15 Back to charters
After the bash ends, the raw list is 30-80% noise: dupes, P3s, and "not bugs" (intentional behavior). The agent runs a structured de-dupe pass.
De-dupe rubric:
Ranking pass:
Sort by:
The agent outputs a ranked CSV ready to import to Linear/Jira with severity, owner, reach, est-effort, and ship-decision columns.
Not every P1 gets fixed before ship. The agent applies explicit rules so the trade-offs are visible.
Decision matrix:
P0 → ALWAYS FIX before ship. No exceptions, no waivers.
P1 → Fix unless 3 conditions ALL met:
a. Workaround documented and shippable in release notes
b. Reach < 5% of impacted user segment
c. Fix is not low-effort (>4h estimated)
Otherwise: fix.
P2 → Fix if low-effort (≤2h) OR if it touches a marketed feature.
Otherwise: ship-with, file as fast-follow patch within 2 weeks.
P3 → Backlog. Review at next quarterly polish sprint.
Fast-follow patch SLA:
A bash typically generates 3-8 P2 bugs that ship-with. Commit to a patch release within 2 weeks of the main release with all of them fixed. Without this commitment, P2s rot to P3 and accumulate as platform debt.
Waiver template (when shipping with a known P1):
WAIVER — [Bug ID] — [Title]
Severity: P1
Reach: [%] of [segment]
Workaround: [user-facing instruction in release notes]
Fix ETA: [date] in patch release [version]
Approver: [name] Date: [date]
Communicated to: [#support, #cs, status page]
The bash ends with a 30-minute retro — separate from the triage. Format: "Start / Stop / Continue" plus three numeric metrics.
Retro template:
# Bug Bash Retro — [Release] — [Date]
Headline metrics:
Bugs filed: [N]
After de-dupe: [N] (signal ratio: [%])
P0 found: [N] (released-with: [N])
P1 found: [N] (released-with: [N], waivers: [link])
Coverage: [surfaces tested fully / surfaces in scope]
Start (do this next time):
-
Stop (don't repeat):
-
Continue (worked well):
-
Charters that produced the most signal: [top 3]
Charters that produced the least signal: [bottom 3] — replace?
Tooling debt:
- (e.g. seed-data script broken, test accounts not enough,
intake form too slow)
People debt:
- (e.g. need a CS rep next time, designer was overbooked)
Action items (owner + date):
-
Calendar-anchored. Adapt spacing for shorter release windows but never drop a step.
T-7d 1. KICKOFF — lock scope doc with EM + PM, save-the-date,
identify scribe and backup facilitator
T-5d 2. INVITES — send invite (mix per Step 2), share scope doc +
severity rubric + intake link, request 15-min pre-read
T-3d 3. TEST DATA + ENV — seed accounts for every role, verify
staging mirrors prod, file a fake bug yourself end-to-end
T-1d 4. CHARTERS — generate 8-15 charters across surfaces × devices,
assign 2 testers per charter, post board, confirm war room
T-0 5. RUN OF SHOW (6h):
0:00 Kickoff 15m (scope, rules, severity, intake demo)
0:15 Round 1 charters (75m)
1:30 Triage huddle (15m)
1:45 Round 2 charters (75m, partner rotation)
3:00 Lunch (30m, keep filing)
3:30 Triage huddle (15m)
3:45 Round 3 charters (75m, focus on triage gaps)
5:00 Final triage (30m)
5:30 Retro (30m)
T+0pm 6. DE-DUPE + RANK — within 4h of end, output ranked CSV to EM
T+1d 7. FIX SPRINT — standup with owners, P0/P1 due dates assigned,
P2 fast-follow milestone created, waivers drafted for ship-with P1
T+1d 8. RETRO PUBLISH — send retro doc, file tooling/people debt as tickets
Slack invite: announce date/timezone/duration, role on the matrix, link to scope doc and intake form, war room channel and always-on Zoom, what to bring (laptop + secondary device).
Notion charter board: columns are Charter ID, Surface, Charter sentence, Tester 1, Tester 2, Device, Status. One row per charter, kanban-style with Active/Done states.
Slack triage huddle ping: @channel triage huddle in 2 min — [N] new bugs since last. Filers ready to read title + propose severity.
Context: a 25-person seed-stage SaaS shipping a redesigned signup-to-first-value onboarding flow. Release in 7 days. Available: 9 people for one afternoon.
Scope: signup, email verification, plan selection, workspace creation, invite teammates, first-task creation. Out of scope: existing user upgrade flow, mobile app (separate bash next month).
Mix: 4 eng + 1 PM + 1 design + 2 CS + 1 sales = 9 people.
Charters generated (10 charters, 2 hours each across 3 rounds with rotation):
C-01 Signup flow: Chrome desktop, valid emails, +aliases
C-02 Signup flow: mobile Safari, deferred email verification
C-03 Workspace creation: name collisions, special chars, very long names
C-04 Invite teammates: email typos, already-member, expired invites
C-05 Plan selection: trial-to-paid mid-onboarding, card decline
C-06 First-task creation: empty state, large paste, abandoned mid-create
C-07 Browser back button across every step
C-08 Tab duplication mid-onboarding
C-09 Returning user who abandoned at step 3 yesterday
C-10 Slow network (DevTools throttle, "Slow 3G") through entire flow
Outcome: 31 bugs filed, 19 after de-dupe. 1 P0 (workspace creation race condition double-creates on slow network), 4 P1 (back-button data loss in 3 steps, Safari email validation), 8 P2, 6 P3. Released on schedule with the P0 fixed and 2 P1 waivers (Safari ones, fixed in patch 9 days later).
Context: bash just ended, raw intake CSV has 47 rows.
Agent groups by surface then symptom: Billing has a 5-dupe cluster on "annual toggle doesn't update price" — pick canonical with the cleanest repro, escalate P2 → P1 because 5+ reports is signal. A 3-dupe cluster on card-decline error stays P2 (workaround = retry). Onboarding has a 4-dupe cluster on back-button losing workspace name → P1 (data loss, however small). Singletons kept as-is unless tagged "not a bug" (close as invalid).
Outcome: 47 raw → 22 canonical → 1 P0, 5 P1, 9 P2, 6 P3, 1 invalid. Output is CSV with one row per canonical bug, dupes referenced in a child column.
The agent produces:
Tighten the intake form: required severity, required repro steps, required attachment for UI bugs, required charter ID. Add a dupe-check checkbox. Most noise comes from missing fields, not bad testers.
Cut to 2 charter rounds (45 min each) on the highest-risk surface only. Skip the formal retro; do a 10-min Slack-thread retro async. Coverage will be narrower but the signal-per-hour is the same.
Tie the bash to the release readiness review — exec sign-off requires bash participation by the on-call eng for the release. Make it a calendar block, not a "if you have time."
Your automation pyramid is upside-down. Bashes should find novel exploration bugs, not regressions. File the recurring bugs as missing test coverage and require an automated test before the next release. The bash should not catch what unit + E2E tests should.
Use the rubric strictly — read the hard tests aloud. If two engineers disagree on P1 vs P2, default to P1 (the higher severity) and let the EM downgrade in writing afterwards. Debate in async, not in the room.
Yes — remote bashes match in-person on bug count if the war room is loud (Zoom always-on, Slack channel pinged on each filing, triage huddles on camera). They lose the side-conversation discoveries but gain device variety (everyone's home setup is different).