DevOps & Security
Incident Postmortem Writer
Generates a structured blameless postmortem from incident timelines, alerts, and deploy logs with root cause analysis, impact assessment, and owned action items. Useful for producing first-draft postmortems under operational pressure. SRE and platform teams running incident retros, engineering leads writing postmortems under time pressure, compliance teams preparing SOC 2 / ISO 27001 incident-response evidence, founders who do incident response for their own small teams. After an incident, the people with the most context are the ones too tired to write, and the ones with the energy to write don't have the context. The result is either a postmortem that gets written weeks late (losing accuracy) or a shallow one that doesn't surface real learnings. A structured writer takes the raw material — timeline of pages, deploy logs, chat messages — and produces a first-draft postmortem that the team reviews and corrects, cutting the emotional and time cost by 80%.
One-Time Purchase
$19.99
Postmortem — INC-2026-042: Payments API 5xx Spike
Severity: SEV2 (degraded primary service, no full outage) · Author: @oncall-platform
One-paragraph summary
A deploy to the payments service landed last week with a regression in the idempotency-key cache. The cache returned stale write-state for ~9% of POST /charges requests, causing a 5xx spike that lasted 26 minutes from first alert to rollback. No money moved incorrectly. Mitigation was a rollback; resolution was a forward-fix the next day. Detection was fast (1 minute alert-to-page), mitigation was slow (we hesitated to roll back during a peak window).
Total detect-to-resolve. Detect-to-mitigate (rollback) was 26 minutes; the second 26 minutes was forward-fix + verification. SLO impact: 0.04% of the error budget for the month.
Impact
| Dimension | Value |
|---|---|
| Requests affected | ~14,200 POST /charges with 5xx response |
| Customers affected | 1,840 unique merchants |
| Funds at risk | $0 — no incorrect charges |
| Downstream services degraded | 2 (Subscriptions retried; Webhooks fell behind by ~8 min) |
| Error budget consumed | 0.04% of the monthly budget |
Timeline (UTC)
What happened, in order
Contributing Factors
| Factor | Severity | Notes |
|---|---|---|
| Idempotency cache returned stale write-state for some keys | High | Direct cause; introduced in v3.2.0 |
| Canary metrics didn't catch the regression | Medium | Cache hit-rate threshold was too lenient; 10% canary was not enough traffic to surface |
| Rollback hesitation during peak hour | Medium | We waited ~6 minutes debating forward-fix vs rollback; should have been a rollback the moment cause was identified |
| Runbook didn't list the cache as a possible 5xx source | Low | Slowed root-cause hypothesis by ~4 minutes |
Action Items
| # | Action | Owner | Due | Priority |
|---|---|---|---|---|
| 1 | Add cache-consistency check to canary gates | @platform-team | This sprint | P0 |
| 2 | Tighten 5xx canary threshold; promote at 25% not 10% | @platform-team | This sprint | P0 |
| 3 | Update payments runbook: cache as 5xx source | @oncall-platform | Next week | P1 |
| 4 | Rollback-first policy doc: cause-clear → rollback within 2 min | @eng-leadership | Two weeks | P1 |
| 5 | Idempotency-cache integration test for stale-read scenario | @payments-team | Next sprint | P2 |
What Went Well
Detection
Alert fired within 60 seconds of the deploy. On-call acked in under a minute. The instrumentation we shipped last quarter is paying for itself.
Comms
Status-page update went out at 14:21 — nine minutes after alert. Customers reported they appreciated the early heads-up even before we had a root cause.
What to Change
Default to rollback
The biggest single time loss was deliberating between rollback and forward-fix while customers were getting 5xx. If a deploy is identified as the cause and the rollback is clean, the rollback should start within 2 minutes of cause identification. The forward-fix is a separate decision.
Canary gates need teeth
A 10% canary that runs for under 10 minutes will not catch a 9%-of-requests bug at any reasonable threshold. Either expand the canary or add cache-specific signals as gates.
This sample illustrates the skill's output format. Names, numbers, and timelines are illustrative.
View full sample →
All sales final. No refunds on digital products.
Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.
Also in Incident Response
Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.
Best for
On-call engineers and SREs who need to ship a postmortem within 48 hours of an incident while context is fresh but writing energy is low. Most useful for teams running a real blameless retro practice where the first draft is meant to be reviewed and corrected, not published as-is.
Not ideal for
A standalone source for regulatory incident disclosures (HIPAA breach notification, GDPR Article 33 reports, public-company material event filings). It is fine to use the output as a first draft for those, but counsel must review and edit before anything that carries legal weight is filed externally.
Included in this purchase
- Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
- Setup guidance for the right adapter in your workspace.
- One-time license for the purchased skill version.
Setup
Plan for a short copy-and-configure setup in your preferred agent workspace. No custom integration is required for the skill file itself.
Related Skills
$19.99
One-time license
$19.99
One-time license
$19.99
One-time license
Future Updates
This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.