Wednesday, May 4, 2016

Read from file: Length-prefixed protocol buffers

File format

In short it’s binary encoded protobuf message prefixed with the size. For longer description please check Length-prefix framing for protocol buffers.

In most languages the file is read as stream. If you need to open and parse a file which holds lenght-prefixed protobuf messages in Python, you read internet and you find a code which read the whole file. It’s huge bottleneck. The hit in a performance between reading whole file then parsing byte after byte and using BufferedReader was in 1000x. Thus enjoy my little piece of code:

def ReadItm(fname, constructor, size_limit = 0):
    ''' Reads and parses a length prefixed protobuf messages from file. 
        The file MUST not be corrupted. The parsing is equivalent to parseDelimitedFrom.
    '''
    f = None
    if fname.endswith('.gzip'):
        f = gzip.open(fname, 'rb')
    else:
        f = open(fname, 'rb')
    reader = BufferedReader(f)
    bytes_read = 0
    while size_limit<=0 or bytes_read<size_limit:
        buffer = reader.peek(10)
        if len(buffer) == 0:
            break
        (size, position) = decoder._DecodeVarint(buffer, 0)
        reader.read(position)
        itm = constructor()
        itm.ParseFromString(reader.read(size))
        bytes_read = bytes_read + position + size
        yield itm
    f.close()

No comments:

Post a Comment