Install
openclaw skills install bookforge-dos-defense-and-mitigationDesign DoS-resistant systems or respond to an active denial-of-service attack. Use this skill when designing a new service and want to evaluate its attack surface and build in layered defenses, assessing whether a production system's architecture is DoS-hardened, investigating a traffic spike to determine whether it is an attack or a self-inflicted surge, detecting a client retry storm and needing to apply backoff and jitter fixes, building or reviewing a DoS mitigation system (detection + response pipeline), or deciding how to respond strategically to an ongoing attack without leaking information about your defenses to the adversary.
openclaw skills install bookforge-dos-defense-and-mitigationApply this skill when:
The core economic model: A DoS attack is a supply-and-demand imbalance. The adversary drives demand above your supply capacity. Simply absorbing all attacks by overprovisioning is rarely the most cost-effective approach. Instead, eliminate as much attack traffic as possible at each successive layer, so the expensive inner layers only have to handle what the cheap outer layers could not stop.
Dependency note: This skill builds on load shedding, throttling, and graceful degradation concepts from resilience-and-blast-radius-design. If those mechanisms are not yet defined for the system under review, complete that skill first.
Before starting, confirm you have:
If a system description is provided, scan for:
For active traffic anomalies, additionally scan for:
You have enough to proceed when:
Understanding how an attacker would approach the system lets you find its weakest points before they do.
Map the request dependency chain for a typical user request:
An attack that disrupts any link in this chain disrupts the service. The attacker will target the link with the lowest cost-to-disrupt.
Attacker resource efficiency: A sophisticated attacker will not flood with simple requests — they will generate requests that are more expensive to answer than to send. Examples: triggering search functionality, initiating sessions that exhaust connection state, or exploiting high-cost API endpoints.
Attack types by scale requirement:
| Attack type | How it works | Primary defense |
|---|---|---|
| Volumetric flood | Saturate bandwidth or CPU with high packet/request rate | Edge throttling, anycast dispersal |
| Amplification (DDoS) | Spoof victim IP; small requests generate large responses from third-party servers (DNS, NTP, memcache) | Router ACLs blocking UDP from abusable protocols; network-layer filtering |
| Application-layer | Legitimate-looking requests targeting expensive operations | Application-layer rate limiting; CAPTCHA challenges |
| Botnet / DDoS | Distributed attack from many machines — cannot be blocked by single-source filtering | Shared infrastructure defenses; collaboration with upstream providers |
Threat model priority: Use the number of machines an attacker would need to control to cause user-visible disruption as a proxy for attack cost. Prioritize defending the attack vectors that are cheapest for an adversary to mount against your specific architecture.
Output for this step: Dependency chain map with the weakest link annotated. Attack type assessment ranked by attacker cost-to-mount.
Layered defenses eliminate attack traffic as early as possible, protecting expensive inner layers from having to absorb what cheaper outer layers can stop.
The three-layer stack to evaluate:
Internet traffic
|
[Edge routers] ← throttle high-bandwidth attacks; drop suspicious traffic via ACLs
|
[Network load balancers] ← throttle packet-flood attacks; protect application load balancers
|
[Application load balancers] ← throttle application-specific attacks; protect service frontends
|
[Service frontends]
|
[Backends / databases]
For each layer, ask:
Shared infrastructure advantage: Defenses at the network and load-balancer layers protect every service behind them. A single investment covers a broad range of services. This is the most cost-effective place to deploy defenses — do not skip it in favor of service-level-only fixes.
Anycast for geographic distribution: If a large DDoS targets a single datacenter, anycast routing automatically disperses traffic across all locations announcing the same IP address. No reactive system is needed — traffic is naturally absorbed across the global footprint.
Caching proxies near the edge: Deploy caching proxies close to the edge with correct Cache-Control headers. Cached responses require zero backend processing. This reduces both attack impact and normal operating costs.
Amplification defense: Router ACLs that throttle or block UDP traffic from protocols used for amplification (DNS, NTP, memcache) stop reflected amplification attacks at the edge. These attacks are identifiable by their well-known source ports.
Output for this step: Layer-by-layer defense inventory table. Mark each layer: defended / partially defended / undefended. Identify the first undefended layer that attack traffic reaches.
Service and application design choices have a significant impact on how well a service survives a DoS attack — and how much it costs to run in normal operation.
Three design levers to evaluate:
Caching proxies (highest impact)
Cache-Control and related headers so proxy servers can serve repeated requests without hitting the application backendMinimize application requests
Minimize egress bandwidth
Output for this step: Service-level defense checklist with gap annotations. Estimate the cache hit rate for the highest-traffic endpoints.
Outage resolution time is dominated by mean time to detection (MTTD) and mean time to repair (MTTR). A DoS attack may appear as a spike in CPU, memory exhaustion, or error rate — not obviously as a traffic anomaly — unless request-rate monitoring is in place.
Minimum monitoring requirements:
Alerting principles:
Why this matters: Noisy alerts that fire before human action is required train teams to ignore pages. Alert only when human intervention may actually change the outcome.
Output for this step: Monitoring metric list with alert thresholds and escalation conditions. Flag any layer with no request-rate visibility.
When absorbing an attack is not feasible, the goal is to reduce user-facing impact to the minimum. This step relies on the load shedding and throttling mechanisms defined in resilience-and-blast-radius-design.
Throttle, do not block outright:
Quality-of-service (QoS) prioritization:
Application degraded modes:
CAPTCHA as a mitigation bridge:
Output for this step: Degraded mode definitions per service component. QoS priority assignments for critical traffic. CAPTCHA/challenge strategy if applicable.
An automated DoS mitigation system provides fast, consistent response that does not depend on human reaction time. It must be designed to handle its own failure modes safely.
Two required components:
Detection:
Response:
Failure mode design:
Output for this step: Detection + response component design. Failure mode policy (fail static confirmed). Canary deployment plan for automated responses.
Responding purely reactively — filtering the attack traffic signature immediately — teaches the adversary what your defenses can see. A strategic response exploits the adversary's uncertainty about your capabilities.
Do not expose your detection method:
Example: An attack arrived with User-Agent: I AM BOT NET. Rather than dropping all traffic with that string (which would teach the attacker to change User-Agents), enumerate the IPs sending that traffic and intercept all of their requests with CAPTCHAs — including any future requests, even with a changed User-Agent. This blocked the botnet's A/B testing capability.
Adversary capability inference:
When the adversary may not be an attacker:
Collaboration and escalation:
Output for this step: Response plan that avoids revealing the detection method. Adversary capability assessment. Escalation path to upstream providers if needed.
Not all traffic spikes are attacks. During an incident, the natural instinct is to look for an adversary — but a self-inflicted surge can look identical to a volumetric attack and will be worsened by adversarial countermeasures.
Two categories of self-inflicted surge:
Organic traffic surge (synchronized user behavior):
Client retry storm (misbehaving software):
Distinguishing attack from self-inflicted surge — diagnostic checklist:
| Signal | Suggests attack | Suggests self-inflicted surge |
|---|---|---|
| Requests match real browser/OS distribution | No | Yes |
| Requests originate from expected geographic regions | No | Yes |
| Requests target a diverse set of queries/endpoints | No (attacks are focused) | Yes |
| Traffic arrives in correlated waves | Possible (botnet scanning) | Yes (event-driven) |
| Traffic correlates with a known external event | No | Yes |
| DNS retry rate is 30x normal | No | Yes (retry storm) |
Output for this step: Self-inflicted surge assessment: organic event or retry storm. Fix: design change (organic) or backoff + jitter implementation (retry storm).
Produce a DoS defense assessment report with the following sections:
Simply absorbing attacks by overprovisioning is not cost-effective at scale. The defender's strategy is to eliminate attack traffic at each layer for the minimum cost, so the expensive inner layers see only residual attack volume. Shared infrastructure defenses (edge, network LB) are the highest-leverage investment because they protect all services at once.
Each defense layer should be able to handle the attack traffic that breaches the outer layer. Defenses near the edge are cheap (bandwidth is shared); defenses deep in the stack are expensive (CPU, database connections). Drop as early as possible.
A DoS mitigation controller that fails open lets attack traffic through. One that fails closed creates a self-inflicted outage. Failing static — freezing the current policy — is the correct tradeoff: the system continues functioning at whatever state it was in when the controller went down, without making things worse in either direction.
Immediately dropping traffic matching the attack signature reveals exactly what your detection sees. Instead, respond in ways that do not fingerprint your detection method — for example, challenging all traffic from identified sources rather than only traffic matching the current attack signature. This blocks the adversary's ability to A/B test your defenses.
Applying rate limiting and blocking to an organic surge or a client retry storm will worsen the situation. Always verify the traffic profile before applying adversarial countermeasures. The correct response to a retry storm is to serve as many requests as possible while backoff + jitter propagates; the correct response to an organic surge is a design change that reduces demand.
Exponential backoff without jitter still produces synchronized bursts when many clients fail simultaneously. Jitter without backoff limits per-client load but does not prevent total load from remaining high. Both are necessary. At Google, exponential backoff with jitter is standard in all client software.
Google received an attack where all traffic contained User-Agent: I AM BOT NET. Rather than dropping that User-Agent (which would immediately teach the attacker to use User-Agent: Chrome), SREs enumerated all IPs sending that traffic and applied CAPTCHA challenges to all of their requests — regardless of User-Agent. This prevented the attacker from using A/B testing to discover which signals the defense was keying on, and blocked future requests even after the User-Agent changed.
In 2009, Google Search received a burst of traffic for German words with identical character prefixes, arriving in three waves roughly 10 minutes apart. Initial suspicion: a botnet conducting a dictionary attack. Investigation found the traffic originated from machines in Germany and matched real browser distributions. Root cause: a televised game show challenged contestants to find word completions with the most Google search results — and viewers at home searched along. The response was a design change: adding word-completion suggestions as users type, which reduced the number of queries users submitted. No adversarial countermeasures were needed.
An authoritative DNS server experiences an outage. Recursive DNS servers controlled by external organizations immediately begin retrying, escalating to 30x normal traffic. This prevents the server from recovering — each attempted recovery is overwhelmed by the retry flood. The correct response is to serve as many requests as possible (each successful answer lets one DNS resolver escape its retry loop) while applying upstream request throttling to preserve server health. The long-term fix is to ensure all clients implement exponential backoff with jitter — but this cannot be controlled externally.
When blocking by IP, legitimate users behind the same NAT as an attacker are blocked. A CAPTCHA challenge allows them to prove they are human and receive a browser-based exemption cookie. The cookie must contain: a pseudo-anonymous identifier (allows abuse detection and revocation), the challenge type (allows requiring harder challenges for suspicious behaviors), a timestamp (allows expiring old cookies), the solving IP address (prevents botnets from sharing a single exemption across many machines), and a cryptographic signature (prevents forgery).
resilience-and-blast-radius-design skill for load shedding and throttling implementationThis skill is licensed under CC-BY-SA-4.0. Source: BookForge — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
Install related skills from ClawhHub:
clawhub install bookforge-resilience-and-blast-radius-designOr install the full book set from GitHub: bookforge-skills