DevOps & Security
Rollback Runbook Writer
Produces a deployment rollback runbook with verification commands, notification steps, and escalation contacts, versioned alongside each release. Useful for writing rollback plans before incidents instead of during them. SRE and platform teams writing per-release runbooks, engineering leads responsible for deploy safety, on-call engineers who inherit rollback responsibility for systems they did not build. Teams that do write them often produce high-level descriptions that lack the specific commands and verification steps needed at 3am. When rollback fails, the recovery path gets longer and the incident turns from a quick fix into a war-room session. A rollback runbook generated at deploy time — calm, specific, with verified commands — shortens incidents by design.
One-Time Purchase
$19.99
# ROLLBACK-1.4.2.md — Payments API
**Release:** 1.4.2 (rolling back to 1.4.1)
**Deploy target:** Kubernetes cluster `prod-us-east`, namespace `payments`
**Author of runbook:** Generated at deploy time; verify commands before executing
**Rollback owner on call:** primary SRE on rotation (see PagerDuty)
<div data-callout="critical" data-label="When to run this">
Trigger this runbook **immediately** if any of the following are true after the 1.4.2 rollout completes:
- Error rate on `/v1/charges` exceeds 1% for 3+ minutes
- p99 latency on `/v1/charges` exceeds 2s for 5+ minutes
- Any unrecoverable schema or contract error in the logs (`column does not exist`, `unknown field`)
- Stripe webhook delivery success rate drops below 99%
Do not wait for a customer report. The deploy notes for 1.4.2 flag a refactor of the charge-capture path; the rollback path below has been tested in staging against this exact release.
</div>
<div data-stack data-stack-title="Procedure summary">
<div data-row data-value="1">Page secondary SRE + payments engineering lead</div>
<div data-row data-value="2">Snapshot current state (image digest, replica count, last 5m of metrics)</div>
<div data-row data-value="3">Roll back via `kubectl rollout undo`</div>
<div data-row data-value="4">Verify pods, probes, RED metrics, Stripe webhook deliverability</div>
<div data-row data-value="5">Post incident note in #payments-eng + write postmortem ticket</div>
</div>
---
## Step-by-step
### 1. Snapshot the current state
```bash
# Image and rollout history (capture both)
kubectl get deployment payments-api -n payments -o yaml | grep image:
kubectl rollout history deployment/payments-api -n payments
# Current replica count (so we can confirm the rollback preserves it)
kubectl get deployment payments-api -n payments -o jsonpath='{.spec.replicas}'
```
Paste the output into the incident channel before continuing.
### 2. Roll back
```bash
kubectl rollout undo deployment/payments-api -n payments
# Watch the rollout — should complete in 60–90 seconds for 3 replicas
kubectl rollout status deployment/payments-api -n payments --timeout=3m
```
If `rollout undo` reports `no previous revision`, see the **escalation** section.
### 3. Verify
```bash
# 3a. All pods running and ready
kubectl get pods -n payments -l app.kubernetes.io/name=payments-api
# 3b. Image matches the previous-release digest
kubectl get deployment payments-api -n payments -o jsonpath='{.spec.template.spec.containers[0].image}'
# 3c. Probe a known-good endpoint
curl -sf -o /dev/null -w "%{http_code}\n" https://payments.example.com/ready
# 3d. Inspect the RED dashboard — error rate should return to baseline within 2 minutes
open "https://grafana.example.com/d/payments-red"
```
---
## Decision tree
| Symptom after rollback | Action | Severity |
|---|---|---|
| All pods Ready, error rate falls to baseline | Mark incident mitigated; schedule postmortem | <span data-pill="positive">resolved</span> |
| Pods Ready but Stripe webhook success rate still degraded | Confirm Stripe dashboard, then page Stripe integration on-call | <span data-pill="caution">partial</span> |
| Pods CrashLoopBackOff after rollback | Stop. Database migration likely involved — go to escalation | <span data-pill="critical">escalate</span> |
| `rollout undo` reports no previous revision | Stop. Hand off to platform on-call | <span data-pill="critical">escalate</span> |
| New rate of 5xx introduced by rollback itself | Stop. Forward-fix may be required | <span data-pill="critical">escalate</span> |
<div data-callout="caution" data-label="If 1.4.2 included a migration">
1.4.2 added a non-null column to `charges`. The migration is **backward-compatible** (column has a default), so the rollback works without a schema revert. **But** if any 1.4.2 pod has already written rows that depend on app logic not present in 1.4.1, those rows may be unreachable until forward-fix. Confirm with the payments engineering lead before declaring "all clear."
</div>
<div data-callout="critical" data-label="Escalation triggers">
Page the **payments engineering lead** and **platform on-call** if any of the following happen:
- `kubectl rollout undo` exits non-zero
- More than 1 pod is in CrashLoopBackOff for over 2 minutes after rollback
- Database queries return `column does not exist` after rollback (indicates migration drift)
- Customer-visible payment failures continue beyond 5 minutes after rollback completes
Escalation channel: PagerDuty service **payments-prod**. Bridge: Zoom link in #incident.
</div>
---
## Postmortem ticket template
```markdown
- Detection: <how it was caught — alert ID, customer report, etc.>
- Time to mitigation: <rollback complete - first alert>
- Root cause (hypothesis): <fill after investigation>
- Action items:
- [ ] <add automated check that would have caught this pre-deploy>
- [ ] <improve canary criteria>
```
*This runbook was generated at deploy time against release 1.4.2. Re-generate for every release; rollback paths are version-specific.*
This sample illustrates the skill's output format. Names, metrics, and operational details are illustrative unless the artifact explicitly analyzes public information.
View full sample →
All sales final. No refunds on digital products.
Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.
Also in Incident Response
Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.
Best for
SRE and platform teams that ship to production weekly or more and want a per-release rollback runbook generated at deploy time — with the actual commands, dashboards, and verification steps for this release, not a generic template. Especially valuable for on-call engineers who inherit rollback responsibility for services they didn’t build.
Not ideal for
Deploys that are functionally not rollback-able (database migrations that have run, feature flags already rolled forward through the cache, regulatory state changes) — pretending a runbook exists for those is worse than acknowledging the rollback path is forward-fix. Also a poor fit for teams without real deploy automation; the runbook references commands that have to actually work.
Included in this purchase
- Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
- Setup guidance for the right adapter in your workspace.
- One-time license for the purchased skill version.
Setup
Plan for a short setup in the repository or workspace where the skill will run. Some coding familiarity helps for implementation-heavy outputs.
Related Skills
$19.99
One-time license
$19.99
One-time license
$19.99
One-time license
Future Updates
This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.