Tuesday, May 10, 2016

Postmortem culture

Culture

It happens, something goes wrong, the system goes down, it stops proceeding orders or serve ads or … situation becomes nervous.

I would say one of the main differences how organisation handles those critical situations is a pretty good indicator of it shape. I’ve seen a companies putting the employees under huge pressure during incidents and blaming after.

Then, I’ve worked for Google and… When you make mistake, the impact factor happens to be thousands QPS. You look at graph and see how some stats goes down or crazily spike. You rollback or do emergency release and it starts (ok, it’s more complex, but hey). You (in most cases more than you) are responsible for writing a postmortem.

What is a postmortem?

In short, it’s a document describing what happened. I know in many companies it looks different, many adopted it after some ex-Googler has joined them (can we call it Googleism? or Googleisation?). What is important in Google’ postmortems is it’s purpose and main pur aim of postmortem is: to learn and never repeat old mistakes.

That changes the perspective dramatically. Writing about your own fuckup is not trivial and writing in no-blame way is even harder. It’s not easy even for local stars (local genius theory). How to write postmortem? What to put in it? How it should look? Check out the links

What’s next?

The document usually should be created in next 24-48h after, unless the incident is very complex or not understood. Usually, the PM are open for comments, so everyone may ask a question for some details. When ready and the document went through some ‘peer-review’, it’s should be sent to all interested parties (mailing group, #slack, whatever) and it should be discoverable. Means, it should end up in bug tracker under the issue (because, you have an open issue about the outage, right?). It should be kept in some postmortems’ database. Thus, even if it’s not you who pushed a binary or write the code, you will learn. And in the future, you can always look for similar cases.

Wednesday, May 4, 2016

Read from file: Length-prefixed protocol buffers

File format

In short it’s binary encoded protobuf message prefixed with the size. For longer description please check Length-prefix framing for protocol buffers.

In most languages the file is read as stream. If you need to open and parse a file which holds lenght-prefixed protobuf messages in Python, you read internet and you find a code which read the whole file. It’s huge bottleneck. The hit in a performance between reading whole file then parsing byte after byte and using BufferedReader was in 1000x. Thus enjoy my little piece of code:

def ReadItm(fname, constructor, size_limit = 0):
    ''' Reads and parses a length prefixed protobuf messages from file. 
        The file MUST not be corrupted. The parsing is equivalent to parseDelimitedFrom.
    '''
    f = None
    if fname.endswith('.gzip'):
        f = gzip.open(fname, 'rb')
    else:
        f = open(fname, 'rb')
    reader = BufferedReader(f)
    bytes_read = 0
    while size_limit<=0 or bytes_read<size_limit:
        buffer = reader.peek(10)
        if len(buffer) == 0:
            break
        (size, position) = decoder._DecodeVarint(buffer, 0)
        reader.read(position)
        itm = constructor()
        itm.ParseFromString(reader.read(size))
        bytes_read = bytes_read + position + size
        yield itm
    f.close()