Learning from Production Incidents
Note: This was originally posted internally at Walmart, and has since been sanitized for public consumption.
The postmortem process is a tool that we use to better understand failures within our systems. There are two ways to view failures within complex systems: "That failure cost us $250,000" or "The company spent $250,000 to learn this lesson". Taking the second approach, this document aims to outline a process which wrings as much value from that lesson as possible.
There are a few critically important aspects of a postmortem that we'll follow within the company.
Postmortems are meant to be understandable to those without our expertise (those in other internal organizations). The failure modes of complex systems often contain learnings for folks in different domains. We learn a lot about how to prevent failure, for instance, by studying industries that have higher safety requirements (e.g. seat belt manufacturing or aerospace engineers). By making these postmortem understandable to other teams within the company, we've amplified our hard-won learnings so that the entire company can benefit from our investment. Practically speaking, this means referring to your "primary database" rather than "db01" when writing your document.
Postmortems are not tools to blame others. They are a way to drive change in processes and decision making so that we may better serve our customers. To that end, we do not name individuals within postmortems, but reference them by role if necessary. Example of this would be "Operator restarted the Cassandra node to clear up the out of memory issues" or "Operator escalated to director to approve change to production within freeze window".
Postmortems must be timely. There is a real risk to postmortem that linger, because there is a shelf-life on data storage within the company. We don't keep logs and metrics indefinitely, and they have a way of decaying over time (e.g. code changes drift away from logs which makes forensic analysis more difficult or we don't keep metrics around the appropriate granularity). Because of this timeliness concern, we'll complete our postmortem within 1 week of the incident.
Postmortems must be reviewed. This helps us disseminate learnings, but also this outside perspective has a way of uncovering learnings that might have been missed by people close to the problem. To address this, we conduct a regular meeting to read and discuss postmortems within our organization. To ensure that everyone is on the same page during this review process, we'll use a common template across the company. Externally, we can look at this repository, which is similar to the template used within Amazon. This will ensure consistency and ease of following along for those reviewing.
Postmortems must have action items. We put in a lot of effort to uncover root causes and identify resolutions. This value is lost, however, if we are not accountable to ourselves to when this work needs done. To this end, each action item the team finds will require a due date which is set by the team. Teams will be notified as they near these deadlines. We will escalate deadline misses to management so that they may help teams make the necessary time to prevent these issues from happening in the future.