API Failover
v1.0.1Detect AI API/provider/model failures and route requests to healthy fallback providers or downgraded models. Use when creating or maintaining automatic failo...
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
API Failover
Create or improve a lightweight failover layer for AI APIs.
Goals
Build systems that:
- detect unavailable or degraded providers/models
- classify failures before retrying blindly
- switch to a safe fallback chain
- avoid hammering broken endpoints
- recover back to preferred providers after cooldown
Workflow
- Identify the call path.
- Classify failure modes.
- Define a fallback policy.
- Add health memory.
- Implement guarded retries.
- Emit observable logs.
- Validate with forced-failure tests.
Use the detailed rules below and the bundled scripts instead of re-inventing routing logic each time.
Practical defaults
Error classes
Use these normalized categories:
AUTH_ERRORBAD_REQUESTRATE_LIMITTIMEOUTSERVER_ERRORNETWORK_ERRORMODEL_UNAVAILABLEQUOTA_EXCEEDEDUNKNOWN_TRANSIENT
Suggested routing behavior
AUTH_ERROR,BAD_REQUEST: fail fast; do not retry other providers unless config explicitly maps to another credential set.RATE_LIMIT: short backoff, then fallback.TIMEOUT,SERVER_ERROR,NETWORK_ERROR,MODEL_UNAVAILABLE,UNKNOWN_TRANSIENT: retry briefly, then fallback.QUOTA_EXCEEDED: mark provider unavailable for a longer cooldown and fallback immediately.
Circuit breaker defaults
Start with:
- open after
3consecutive transient failures - cooldown
60-180s - half-open with
1probe - close after
1-2successful probes
Configuration pattern
Keep policy in config, not hard-coded logic.
Recommended shape:
- provider registry
- task profiles with ordered fallback chains
- retry policy
- circuit-breaker policy
- per-provider overrides
Design guidance
- Prefer fewer, well-understood providers over large fallback chains.
- Keep the fallback chain semantically compatible when possible.
- Separate "best quality" from "must return something" behavior.
- Keep downgrade rules explicit; avoid silent huge capability drops for critical tasks.
- For tool-using agents, treat provider switching as a reliability event and report it when user-visible quality may change.
Semi-automatic deployment model
Use this skill to discover the environment, generate a production-ish config, run a local HTTP failover proxy, and verify health.
Do not claim full autonomous takeover unless the environment-specific integration is actually completed.
References
Read these only when needed:
references/config-example.yamlfor a compact policy examplereferences/config-realworld-example.yamlfor a more practical multi-provider templatereferences/config-production.yamlfor a ready-to-edit production templatereferences/test-scenarios.mdfor failure-injection and validation casesreferences/realworld-notes.mdfor local proxy deployment and environment-variable setupreferences/api-failover.servicefor a user-systemd service example
Bundled scripts
scripts/discover_env.py
Inspect the current environment.
scripts/generate_config.py
Generate a production-ish YAML config from simple defaults.
scripts/failover_proxy.py
Run a minimal CLI failover call path.
scripts/http_proxy.py
Expose a single local OpenAI-compatible entrypoint.
Endpoints:
POST /v1/chat/completionsGET /health
Optional request header:
X-Failover-Profile: cheap|default|critical|local-first
scripts/selfcheck.py
Validate that the local proxy is reachable and can process a minimal chat request.
scripts/bootstrap_failover.py
Run the semi-automatic bootstrap flow:
- discover environment
- generate config
- optionally start the proxy
- run self-check
- print next actions
Example:
python3 scripts/bootstrap_failover.py \
--default-model custom-ai-td-ee/gpt-5.4 \
--start-proxy
Keep these scripts small and inspectable. Extend them instead of turning SKILL.md into code-heavy instructions.
Files
17 totalComments
Loading comments…
