[Incident Title] - [Incident Date]

Summary

A good summary is 3-5 sentences and includes a high level explanation of what happened, why, and the impact. The summary should be written so that non-technical stakeholders can understand it.

After the summary, the next sections are organized chronologically:

Lead up
Fault
Detection
Resolution
Impact

Lead Up

What happened immediately before the incident? This section will usually describe code changes, deployment activities and/or changes to processes. These are activities and events that lead to the actual production incident.

Fault

What wasn't working? How did failures manifest to end users? If we have data and metrics about error rates, degraded performance or screenshots of visible errors this is where they go.

Detection

How did we learn about the failure? Did we get an alert from our monitoring systems, or did we get an email from a client or customer? What time did this occur, and at what times was the message seen and acknowledged?

Resolution

How did we respond to mitigate the effects of the failure? Did we perform a rollback, increase database cpu capacity, fix a defect? Include details about the things we tried that didn't work. Include details about partial fixes and temporary band-aids. What time did normal operation resume? Did we wind up turning on a maintenance mode? If yes what time was it turned on and what time was the app restored?

Impact

The Impact section is a final reckoning of the consequences of the incident. Try and answer the question "what did this incident cost?". It will be rare that we can answer that question in dollars with much precision, but we can make statements like "On Friday, from 11:30 to 12:30 customers were not able to place orders. During that hour we receive 230 orders on average".

Timeline

The timeline section provides a chronological list of events and actions taken related to the incident. There is no analysis in this section, it is simply a factual accounting. Please see these external examples [1, 2] of what a timeline looks like.

Root Cause

Root cause is one of the most important sections in the document. The root cause analysis is fundamental to learning from the incident and ensuring that incidents can never recur. Problems are almost always rooted in some systemic failure. A good root cause analysis will find gaps in process or strategy, or in very rare cases rule these out.

The "five whys" is a common, light weight, root cause analysis exercise that encourages digging deeper into the fundamental and systemic problems that lead to the incident. Five whys can be done alone, but is more powerful as a group exercise.

Lessons Learned

What went well? What went wrong? Where did we get lucky? The lessons learned example from Google is great one to copy.

Corrective Action

Briefly catalogue the countermeasures and other action items that are to be planned as a result of this investigation. Usually it will be appropriate to link off to other planning docs (or github issues) in this section.

Resources

Five Whys - Lean Enterprise Institute
5 Whys Group Exercise - Atlassian Team Playbook

[Incident Title] - [Incident Date] ​

Summary ​

Lead Up ​

Fault ​

Detection ​

Resolution ​

Impact ​

Timeline ​

Root Cause ​

Lessons Learned ​

Corrective Action ​

Resources ​