Appearance
Postmortems
Be tolerant of all mistakes the first time. Never allow the same mistake to be made twice - The Lean Startup (Eric Ries, 2011)
Introduction
At Synapse we do our best to avoid mistakes and to build mistake-proof systems (especially in production), but we also understand that some failure is unavoidable. In order to learn from our failures and to make more resilient systems, we use a blameless postmortem.
A postmortem is an investigation and analysis that occurs after an incident is resolved. When you're doing a postmortem you're collecting information in order to document what happened, but you're also performing an analysis about why it happened. There is an expectation that the analysis in a postmortem will lead to a set of suggestions and action items that can prevent failures like this from occurring in the future.
Employees performing a postmortem investigation are expected to produce a document with their finding and to publish this document so that we can all learn from one another.
Blameless
It is extremely important that any postmortem activities remain blameless. Nobody wants to work in a place where we're at risk of being shamed by our peers, and blame creates a negative incentive for people to share important information. In order to truly understand what happened and to learn from our failures, we must not single individuals out.
Please follow these guidelines to help set the right tone for any investigative activities:
- Always look for and emphasize systemic root causes, rather than individual mistakes
- For example, it may be true that an individual broke prod, but what are the missing policies or processes that allowed for that individual to make that mistake? Is there missing or inaccurate documentation?
- Never include individual's names in the postmortem report. If you need to refer to an action taken by an individual, you can use a descriptor like "An engineer" rather than their name. I prefer to use collective nouns and pronouns like "the team" and "we".
- Example: "The team merged PR #1234 on Thursday 3/28/24 at 11:28 AM which triggered an automatic deploy."
Who should conduct Postmortems?
It's important for team members to understand that Postmortems are a team effort. These collaborative efforts promote a dependable team culture. When conducting a Postmortem, ensure that there is sufficient buy-in from leadership and your peers. This ensures everyone reading the Postmortem, whether Synapse, or the client see that there is sufficient representation of the team as a whole for the incident. Teams working together to solve problems can help build the learning framework, to prevent them from happening again.
There may be times where you need to conduct a Postmortem with less team members than would be ideal. In the circumstance where the team conducting the Postmortem may be thin, the most important aspects to understand are:
- If your peers are not around, or you only have representation of 1 or 2 of them, try to consider all perspectives [See Blameless Section]
- Try to include at least one layer of leadership
- Encourage honest and completeness
- Ask for external review or buy-in from members outside your team
Types of Postmortem Investigations
We recognize that not all incidents are severe, and not all incidents will require a thorough investigation. Therefore we define two types of postmortem investigations: formal and informal.
A formal postmortem is expected to contain a great deal of detail and analysis into what exactly happened, the impact of the incident, and how it can be avoided in the future. Formal postmortems will also contain timeline of important events. See the formal postmortem template for more details.
An informal postmortem investigation should take much less time and can contain much less detail. The timeline is usually skipped. The informal postmortem report will often only contain a handful of sections.
When to Perform a Postmortem
This policy makes no formal requirement on when to perform a postmortem investigation. As of the writing of this document, you will usually be asked to perform a postmortem if necessary. The general rule of thumb is to assume a formal postmortem investigation will be required for any critical incidents, and an informal postmortem investigation will be required for any major incidents. Please refer to this document to learn about incident severity.
Resources
- Atlassian's postmortem template
- Postmortem Culture: Learning from Failure from Google's SRE handbook
- Example Postmortem from Google's SRE handbook