DevOps & Security
Outage Response Playbook
Outage Response Playbook generates complete, scenario-specific runbooks — not generic templates. It produces severity tiers with measurable criteria, role assignments for every step, step-by-step response procedures, escalation trees with observable triggers, communication templates, resolution checklists, and blameless post-mortem templates. Engineering managers, platform teams, and SRE leads use it to document failure modes before they happen instead of improvising under pressure. A playbook built from this skill is immediately usable — not a starting point that needs another hour of editing before it is safe to hand to an on-call engineer. What makes it production-grade is specificity. Every step has an explicit role owner. Severity tiers use measurable thresholds. Escalation contacts are role-based. Post-mortems are blameless by construction. Write for 3 AM, not for ideal conditions.
One-Time Purchase
$19.99
Outage Playbook — Database Connection Pool Exhaustion
When to run this
Run this when the db-pool-exhaustion alert fires (available connections under 10% of max pool size for over 60 seconds) or when API error rate spikes above 1% with database errors in logs. Customer-facing impact is the default assumption — escalate first, diagnose second.
Severity Classification
| Tier | Symptoms | Action |
|---|---|---|
| P1 | >50% of API requests returning 500s for >5 minutes; customer impact confirmed | Page IC + Tech Lead; status page within 10 min |
| P2 | >20% of API requests returning 500s; partial functionality | Page Tech Lead; IC monitors; internal status update |
| P3 | Pool warnings in logs; no customer-facing impact yet | Investigate during business hours; no page |
Roles
Who does what
Detection & Triage (0–5 min)
First five minutes
Response Steps
-- 1. Confirm pool exhaustion
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- If > 90% of max_connections, pool is exhausted.
-- 2. Identify long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
-- 3. Kill queries running > 5 minutes that are not critical batch jobs
SELECT pg_terminate_backend(pid);
If killing queries restores the pool within 5 minutes, monitor for 15 minutes and proceed to the resolution checklist. If the pool remains exhausted:
kubectl rollout restart deployment/api-server -n production
Escalation — 15 minutes in
If the pool has not recovered 15 minutes after the alert and the application restart did not help, page the Database SRE on-call. The root cause is likely a runaway query plan or a connection-leaking deploy that needs a rollback, not a transient spike.
Escalation — 30 minutes in
If still active at 30 minutes: page the VP Engineering, name a dedicated comms lead, and stand up a status-page incident update every 15 minutes until resolved. Consider whether to fail over read traffic to the replica even if write traffic is still degraded.
Resolution Checklist
- Connection pool utilization below 60% sustained for 15 minutes
- Error rate returned to baseline (under 0.1%)
- Status page updated to Resolved
- PagerDuty incident resolved
- Post-mortem scheduled within 48 hours, doc owner assigned
Post-incident
Capture the live query log from the active-queries dump above before clearing it — the postmortem needs the actual offenders, not just the symptom that the pool was full.
This sample illustrates the skill's output format. Adapt thresholds, dashboards, and runbook links to your environment before relying on this in production.
View full sample →
All sales final. No refunds on digital products.
Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.
Also in Incident Response
Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.
Best for
Platform teams and engineering managers writing per-failure-mode runbooks before incidents instead of during them — database failover, regional outage, payments provider downtime. Most useful for teams with a real on-call rotation and a culture of practicing the runbooks rather than filing them.
Not ideal for
Teams that don’t yet have basic monitoring and on-call coverage — the runbook will reference observability surfaces that don’t exist. Also a poor fit as a substitute for incident response training; the document is the playbook, not the muscle memory of running it under pressure.
Included in this purchase
- Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
- Setup guidance for the right adapter in your workspace.
- One-time license for the purchased skill version.
Setup
Plan for a short copy-and-configure setup in your preferred agent workspace. No custom integration is required for the skill file itself.
Related Skills
$19.99
One-time license
$19.99
One-time license
$19.99
One-time license
Future Updates
This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.