Install
openclaw skills install retry-policy-designerDesign retry and backoff policies for distributed systems. Analyze failure modes, recommend exponential backoff with jitter, circuit breaker integration, tim...
openclaw skills install retry-policy-designerDesign retry policies that actually work in production — not the naive "retry 3 times" that causes cascading failures. Analyze failure modes, calculate backoff curves, integrate circuit breakers, set timeout budgets, and generate production-ready retry configurations.
Use when: "design retry policy", "how should we handle retries", "exponential backoff", "retry strategy", "timeout design", "cascading failure prevention", "how many retries", or when adding resilience to service-to-service communication.
design — Create Retry Policy for a ServiceAsk about the operation to design retries for:
| Property | Options | Impact on retry design |
|---|---|---|
| Idempotent? | Yes/No | Non-idempotent = max 0-1 retries |
| Latency tolerance | Real-time / Batch | Determines total timeout budget |
| Failure mode | Transient / Persistent | Transient = retry, persistent = fail fast |
| Downstream | Internal / External | External = more conservative retries |
| Impact of duplicate | None / Data corruption | Determines retry safety |
import math
def design_retry_policy(
timeout_budget_ms: int, # Total time we can afford
base_delay_ms: int = 100, # Initial backoff
max_delay_ms: int = 30000, # Cap on single retry delay
multiplier: float = 2.0, # Exponential multiplier
jitter_factor: float = 0.5, # Random jitter (0-1)
):
"""Calculate how many retries fit in the timeout budget"""
total_delay = 0
retries = 0
while True:
delay = min(base_delay_ms * (multiplier ** retries), max_delay_ms)
# Add expected jitter (half of max jitter on average)
avg_jitter = delay * jitter_factor * 0.5
total_delay += delay + avg_jitter
if total_delay > timeout_budget_ms:
break
retries += 1
return {
"max_retries": retries,
"base_delay_ms": base_delay_ms,
"max_delay_ms": max_delay_ms,
"multiplier": multiplier,
"jitter": f"±{jitter_factor*100:.0f}%",
"total_budget_ms": timeout_budget_ms,
"estimated_total_delay_ms": int(total_delay),
"retry_delays": [
f"{min(base_delay_ms * (multiplier ** i), max_delay_ms):.0f}ms"
for i in range(retries)
]
}
# Example: API call with 30s budget
policy = design_retry_policy(timeout_budget_ms=30000)
print(f"Max retries: {policy['max_retries']}")
print(f"Delays: {' → '.join(policy['retry_delays'])}")
Always retry:
Never retry:
Retry with caution:
# Retry Policy: [Service Name]
retry:
max_attempts: 4 # 1 initial + 3 retries
backoff:
type: exponential
initial_interval: 200ms
max_interval: 10s
multiplier: 2.0
randomization_factor: 0.5 # ±50% jitter
timeout_budget: 30s # Total time for all attempts
retryable_status_codes:
- 429
- 502
- 503
- 504
non_retryable_status_codes:
- 400
- 401
- 403
- 404
- 409
- 422
circuit_breaker:
threshold: 5 # 5 failures to open
reset_timeout: 30s # Time before half-open
Language-specific examples:
Go:
retryPolicy := retry.NewExponentialBackoff(
retry.WithInitialInterval(200 * time.Millisecond),
retry.WithMaxInterval(10 * time.Second),
retry.WithMultiplier(2.0),
retry.WithRandomizationFactor(0.5),
retry.WithMaxElapsedTime(30 * time.Second),
)
Python:
from tenacity import retry, stop_after_delay, wait_exponential_jitter
@retry(
stop=stop_after_delay(30),
wait=wait_exponential_jitter(initial=0.2, max=10, jitter=5),
retry=retry_if_exception_type((ConnectionError, TimeoutError)),
)
def call_service():
...
JavaScript:
const retry = require('async-retry');
await retry(async () => {
return await fetch(url);
}, {
retries: 3,
factor: 2,
minTimeout: 200,
maxTimeout: 10000,
randomize: true,
});
analyze — Audit Existing Retry LogicScan codebase for retry patterns and flag issues:
# Find retry implementations
rg "retry|backoff|exponential|attempt.*max|max.*retry" \
--type-not binary -g '!node_modules' -g '!vendor' 2>/dev/null
Common anti-patterns to flag:
simulate — Visualize Retry BehaviorShow the timeline of retry attempts with delays:
Attempt 1: t=0ms → FAIL (503)
[wait 200ms ±50%]
Attempt 2: t=180ms → FAIL (503)
[wait 400ms ±50%]
Attempt 3: t=520ms → FAIL (503)
[wait 800ms ±50%]
Attempt 4: t=1180ms → SUCCESS (200)
Total time: 1.2s