Toil Tracker

v1.0.0

Identify, measure, and reduce operational toil — repetitive manual work that scales linearly with service growth. Categorize toil by type, estimate engineeri...

0· 28· 1 versions· 0 current· 0 all-time· Updated 4h ago· MIT-0

Install

openclaw skills install toil-tracker

Toil Tracker

Find the manual work that's eating your engineering time. Toil is repetitive, automatable, tactical work that scales with service size and has no lasting value. Identify it, measure it, prioritize what to automate first, and track reduction over time.

Use when: "how much toil do we have", "what should we automate", "toil budget", "manual operational work", "repetitive tasks", "SRE toil reduction", or during quarterly planning to justify automation projects.

Commands

1. survey — Catalog Toil Sources

Step 1: Identify Toil Categories

Interview the team or analyze work tracking systems. Common toil categories:

CategoryExamplesSignal
DeploysManual deploy steps, config changes, rollbacks"Someone has to click..."
TicketsPassword resets, access requests, cert renewals"Every week we get..."
MonitoringFalse alerts, manual alert triage, dashboard watching"We page about this but..."
ScalingManual capacity adjustments, resource provisioning"When traffic spikes we..."
DataManual data fixes, migrations, backfills"Users file tickets to..."
MaintenanceDependency updates, cert rotations, key rotations"Every quarter we have to..."
OnboardingSetting up dev environments, granting access"New hire setup takes..."

Step 2: Quantify Each Toil Source

# Analyze ticket systems for repetitive patterns
# Jira/Linear — find recurring ticket types
# Example: count tickets by label/type in last quarter

# Analyze on-call alerts for noise
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-04-01&statuses[]=resolved" \
  -H "Authorization: Token token=$PD_TOKEN" | python3 -c "
import json, sys, collections
incidents = json.load(sys.stdin)['incidents']
by_service = collections.Counter(i['service']['summary'] for i in incidents)
print('Incidents by service (potential toil):')
for service, count in by_service.most_common(10):
    print(f'  {count:>4}x  {service}')
"

For each toil source, estimate:

  • Frequency: How often does this happen? (daily, weekly, per-deploy)
  • Duration: How long does it take each time? (minutes, hours)
  • People involved: How many engineers touch this?
  • Scaling: Does it grow with service count, traffic, or team size?
  • Risk: What happens if someone does it wrong?

Step 3: Calculate Toil Budget

def calculate_toil_budget(toil_items, team_size, hours_per_quarter=520):
    """
    Google SRE recommends: max 50% of SRE time on toil.
    """
    total_toil_hours = 0

    for item in toil_items:
        quarterly_hours = item['frequency_per_quarter'] * item['hours_per_occurrence'] * item['people_involved']
        total_toil_hours += quarterly_hours
        item['quarterly_hours'] = quarterly_hours

    team_capacity = team_size * hours_per_quarter
    toil_percentage = (total_toil_hours / team_capacity) * 100

    return {
        'total_toil_hours': total_toil_hours,
        'team_capacity_hours': team_capacity,
        'toil_percentage': toil_percentage,
        'status': '🟢 Healthy' if toil_percentage < 30 else '🟡 Watch' if toil_percentage < 50 else '🔴 Over budget',
        'items_ranked': sorted(toil_items, key=lambda x: -x['quarterly_hours']),
    }

Step 4: Generate Report

# Toil Report — Q2 2026

## Summary
- Team size: 6 SREs
- Total toil: 420h/quarter (13.5h/person/week)
- Toil budget: 34% of capacity 🟡 (target: <30%)

## Top Toil Sources (ranked by hours)
| Rank | Category | Task | Freq | Duration | Hours/Q | Automatable? |
|------|----------|------|------|----------|---------|-------------|
| 1 | Tickets | Access requests | 20/week | 15 min | 65h | ✅ Self-serve portal |
| 2 | Deploys | Manual prod deploy | 3/week | 45 min | 58.5h | ✅ CI/CD pipeline |
| 3 | Monitoring | False alert triage | 10/week | 20 min | 43h | ✅ Tune thresholds |
| 4 | Data | Customer data fixes | 5/week | 30 min | 32.5h | ✅ Admin tool |
| 5 | Maintenance | Cert renewals | 12/quarter | 2h | 24h | ✅ auto-renew |

## Automation ROI
| Project | Est. Effort | Toil Saved/Q | Payback |
|---------|------------|-------------|---------|
| Self-serve access portal | 80h | 65h | 1.2 quarters |
| CD pipeline | 120h | 58.5h | 2.1 quarters |
| Alert tuning sprint | 20h | 43h | 0.5 quarters |
| Admin data tool | 60h | 32.5h | 1.8 quarters |
| Auto cert renewal | 8h | 24h | 0.3 quarters |

## Recommendation
Start with alert tuning (fastest ROI) and auto cert renewal (lowest effort). Then tackle self-serve access portal. Defer CD pipeline to Q3 (high effort but high payoff).

2. prioritize — Rank Automation Candidates

Score each toil source by:

  • Hours saved per quarter (impact)
  • Automation effort (cost)
  • Risk of manual error (safety)
  • Growth rate (will it get worse?)

Calculate ROI = hours_saved_per_quarter / automation_hours.

3. track — Monitor Toil Reduction Over Time

Compare toil hours quarter-over-quarter:

  • Total toil hours trending up or down?
  • Which automation projects delivered expected savings?
  • New toil sources appearing?
  • Toil percentage within SRE budget (< 50%)?

Version tags

latestvk97f9300np7x2v37v61b39cfxx85w9c1