It happens, something goes wrong, the system goes down, it stops proceeding orders or serve ads or … situation becomes nervous.
I would say one of the main differences how organisation handles those critical situations is a pretty good indicator of it shape. I’ve seen a companies putting the employees under huge pressure during incidents and blaming after.
Then, I’ve worked for Google and… When you make mistake, the impact factor happens to be thousands QPS. You look at graph and see how some stats goes down or crazily spike. You rollback or do emergency release and it starts (ok, it’s more complex, but hey). You (in most cases more than you) are responsible for writing a postmortem.
In short, it’s a document describing what happened. I know in many companies it looks different, many adopted it after some ex-Googler has joined them (can we call it Googleism? or Googleisation?). What is important in Google’ postmortems is it’s purpose and main pur aim of postmortem is: to learn and never repeat old mistakes.
That changes the perspective dramatically. Writing about your own fuckup is not trivial and writing in no-blame way is even harder. It’s not easy even for local stars (local genius theory). How to write postmortem? What to put in it? How it should look? Check out the links
- Blameless postmortems - how and what? (archive)
- Google Chrome postmortem template.
- How to write Incident Report / Postmortem - video
The document usually should be created in next 24-48h after, unless the incident is very complex or not understood. Usually, the PM are open for comments, so everyone may ask a question for some details. When ready and the document went through some ‘peer-review’, it’s should be sent to all interested parties (mailing group, #slack, whatever) and it should be discoverable. Means, it should end up in bug tracker under the issue (because, you have an open issue about the outage, right?). It should be kept in some postmortems’ database. Thus, even if it’s not you who pushed a binary or write the code, you will learn. And in the future, you can always look for similar cases.