Skip to main content

DevOps & Security

Outage Response Playbook

Outage Response Playbook generates complete, scenario-specific runbooks — not generic templates. It produces severity tiers with measurable criteria, role assignments for every step, step-by-step response procedures, escalation trees with observable triggers, communication templates, resolution checklists, and blameless post-mortem templates. Engineering managers, platform teams, and SRE leads use it to document failure modes before they happen instead of improvising under pressure. A playbook built from this skill is immediately usable — not a starting point that needs another hour of editing before it is safe to hand to an on-call engineer. What makes it production-grade is specificity. Every step has an explicit role owner. Severity tiers use measurable thresholds. Escalation contacts are role-based. Post-mortems are blameless by construction. Write for 3 AM, not for ideal conditions.

Nexus CertifiedClaude CodeCodexOpenClawGoogle Antigravity
outage-responsereliabilityrunbookson-calloperations

One-Time Purchase

$19.99

Sample Output

Outage Playbook — Database Connection Pool Exhaustion

When to run this

Run this when the db-pool-exhaustion alert fires (available connections under 10% of max pool size for over 60 seconds) or when API error rate spikes above 1% with database errors in logs. Customer-facing impact is the default assumption — escalate first, diagnose second.


Severity Classification

TierSymptomsAction
P1>50% of API requests returning 500s for >5 minutes; customer impact confirmedPage IC + Tech Lead; status page within 10 min
P2>20% of API requests returning 500s; partial functionalityPage Tech Lead; IC monitors; internal status update
P3Pool warnings in logs; no customer-facing impact yetInvestigate during business hours; no page

Roles

Who does what

On-call Engineering Manager — coordinates response, owns severity calls + commsIC
On-call Backend Engineer — diagnoses root cause, executes remediationTech Lead
IC doubles as comms lead unless incident exceeds 30 minutes, then a separate owner is namedComms

Detection & Triage (0–5 min)

First five minutes

[IC] Acknowledge PagerDuty within 2 min; open #incidents Slack channel1
[Tech Lead] Verify alert against Grafana "DB Pool Health" dashboard2
[IC] Classify severity using the tier table; declare in #incidents3
[IC] If P1, post initial status-page update within 10 minutes4

Response Steps

-- 1. Confirm pool exhaustion
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- If > 90% of max_connections, pool is exhausted.

-- 2. Identify long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;

-- 3. Kill queries running > 5 minutes that are not critical batch jobs
SELECT pg_terminate_backend(pid);

If killing queries restores the pool within 5 minutes, monitor for 15 minutes and proceed to the resolution checklist. If the pool remains exhausted:

kubectl rollout restart deployment/api-server -n production

Escalation — 15 minutes in

If the pool has not recovered 15 minutes after the alert and the application restart did not help, page the Database SRE on-call. The root cause is likely a runaway query plan or a connection-leaking deploy that needs a rollback, not a transient spike.

Escalation — 30 minutes in

If still active at 30 minutes: page the VP Engineering, name a dedicated comms lead, and stand up a status-page incident update every 15 minutes until resolved. Consider whether to fail over read traffic to the replica even if write traffic is still degraded.


Resolution Checklist

  • Connection pool utilization below 60% sustained for 15 minutes
  • Error rate returned to baseline (under 0.1%)
  • Status page updated to Resolved
  • PagerDuty incident resolved
  • Post-mortem scheduled within 48 hours, doc owner assigned

Post-incident

Capture the live query log from the active-queries dump above before clearing it — the postmortem needs the actual offenders, not just the symptom that the pool was full.


This sample illustrates the skill's output format. Adapt thresholds, dashboards, and runbook links to your environment before relying on this in production.

View full sample →

All sales final. No refunds on digital products.

Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.

Also in Incident Response

Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.

Best for

Platform teams and engineering managers writing per-failure-mode runbooks before incidents instead of during them — database failover, regional outage, payments provider downtime. Most useful for teams with a real on-call rotation and a culture of practicing the runbooks rather than filing them.

Not ideal for

Teams that don’t yet have basic monitoring and on-call coverage — the runbook will reference observability surfaces that don’t exist. Also a poor fit as a substitute for incident response training; the document is the playbook, not the muscle memory of running it under pressure.

Included in this purchase

  • Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
  • Setup guidance for the right adapter in your workspace.
  • One-time license for the purchased skill version.

Setup

Plan for a short copy-and-configure setup in your preferred agent workspace. No custom integration is required for the skill file itself.

Claude CodeCodexOpenClawGoogle Antigravity

Related Skills

Incident Response
Incident Postmortem Writer
Generates a structured blameless postmortem from incident timelines, alerts, and deploy logs with root cause analysis, impact assessment, and owned action items. Useful for producing first-draft postmortems under operational pressure.
Claude CodeCodexOpenClawGoogle Antigravity
postmortemsincident-responseoperations

$19.99

One-time license

View Skill
Security Scanning
OWASP Top 10 Scanner
Scans code for OWASP Top 10 vulnerability patterns including injection, XSS, IDOR, and insecure deserialization with severity ratings and remediation snippets. Useful for pre-commit security checks and enterprise compliance.
Claude CodeCodexOpenClawGoogle Antigravity
securityowaspvulnerabilities

$19.99

One-time license

View Skill
Security Scanning
Secret Leakage Preventer
Scans code and commits for hardcoded secrets, API keys, connection strings, and credentials, then proposes secure alternatives. Useful for preventing the leading class of AI-era security incidents.
Claude CodeCodexOpenClawGoogle Antigravity
securitysecretscredentials

$19.99

One-time license

View Skill

Future Updates

This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.

Upgrade to Pro